A tutorial on variational Bayes for latent linear stochastic time-series models - Supplementary Material 

Dirk Ostwald, Evgeniya Kirilina, Ludger Starke, Félix Blankenburg 

This document comprises the mathematical underpinnings of the gênerai framework laid out in the tutorial's 
main text. It represents an attempt to discuss the variational Bayes approach to latent linear stochastic differential 
équations in their linear Gaussian state space model approximation in a mathematical cohérent form. This project 
started out from a very modest aim: to review the variational Bayesian estimation and évaluation of latent stochastic 
dynamical Systems in its simplest form - the linear case. During the realization of the project, however, it became 
more and more apparent, that, on the one hand, the "mathematization" of this nascent field is still in its infancy, and 
many concepts, theorems, and proofs are only brought under one umbrella by creating their formai links de novo. 
On the other hand, some more established aspects of the framework (such as Kalman-Rauch-Tung-Striebel 
smoothing) are so engrained in the literature, that for many authors their dérivation appears to be a needless 
endeavor. In other words: over the course of its development, the project assumed a more and more "créative" 
rather than "collective" character. 

This cornes at a price: rather than being a collection of well-established and imperméable mathematical 
results, there are many occasions on which the authors had to rely on their personal analytical skills. As Howard 
Raiffa and Robert Schlaifer put it in the "Préface and Introduction" to their Applied Statistical Décision Theory (1961) 
The MIT Press (p.xvii): "The reader [of this supplementary material] may feel that there is an overabundance of 
proofs and he [or she] may well be right in so feeling. We do, however, have two things to say in self-defence. First, 
many of the formulas we give are hard or impossible to find in other books, and being only moderately sure of our 
own algebra and calculus we wanted to make it as easy as possible for the reader to verify our results; corrections 
will be greatly appreciated. Second, as regards the results which are perfectly well known, our désire to produce an 
introduction to [Variational] Bayesian analysis [for latent linear stochastic time-series models] which would be 
accessible even to reader who were not professionally trained in classical statistics [or time-series analysis] seemed 
to us to imply that we should at least give références of the proofs; but when we tried to to so, we quickly came to 
the conclusion that différences both in notation and in point of view (e.g., as regards to compound distributions [!]) 
would make such références virtually incompréhensible except to those readers who had no need of proofs 
anyway." This says it ail, and to emphasize one aspect again: corrections are greatly appreciated, for example, by 
means of email to the corresponding author. For simplicity, we refer to the high-level introductory text (to be) 
published in the Journal of Mathematical Psychology as "the tutorial" in thèse notes, while formally, the tutorial 
proper comprises both the high-level introductory text and the Supplementary Material. 
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1 On Notation 



In this section, conventions for the mathematical notation used in the tutorial and supplementary material 
are discussed. In gênerai, standard mathematical notation is employed, with some concessions to the simplified 
notations commonly used in statistics and machine learning. The gênerai aim of the notation has been the attempt 
to balance mathematical rigor and practical applicability of the presented framework. 

1.1 Sets and Mappings 

For analytical discussions sets Sthat comprise éléments sharing a set-specific property p are denoted in the 

form 

S = {s\s has the property p} (1.1) 

A function (mapping) / is generally specified in the form 

f:D -> R,x h>/(x) (1.2) 

where the set D dénotes the domain of the function /, the set R dénotes the range of the function /, and 
dénotes the mapping of the domain élément x £ D onto the range élément /(x) G R. In concordance with 
contemporary mathematical language, the terms 'function' and 'mapping' are used interchangeably in this tutorial. 
It is important to note that this (mathematical) notation distinguishes between the function / proper, and éléments 
/(x) of its range. Usually in statements as the above, the spécification of the function abbreviation, its domain and 
its range, is followed by a définition of the functional form linking the domain éléments to the range éléments (for 
example for a quadratic function: /(x) := x 2 ). 

1.2 Derivatives and Intégrais 

The derivative of a function of a single variable is usually denoted in 'operator' form as 

where 

^fM\x=a (1-4) 

dénotes the value of the derivative in x = a. The partial derivative of a function of multiple variables x it i = 1, ... , n 
with respect to variable x t is denoted as 

A /:D ^,^A /W (1 . 5) 

Sometimes the abbreviated bar notation for the derivative, /'(x), is also used. 
Intégrais of a function / are usually denoted as 

Ja/(0# (1-6) 

Often, if the domain of intégration is left unspecified, or implicitly understood as comprising the entire real line, the 
intégral boundaries a, b are omitted. Likewise, if the variable of intégration is n-dimensional, the respective n-time 
intégration is left implicit, e.g. for a function /: M 3 -> M, the intégral on ft-J x [a 2 , b 2 ] x [a 3 , b 3 ] c M 3 

L! £ 2 C 3 /(*!' x 2> x -i) dx 1 dx 2 dx 3 (1.7) 
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is usually denoted as 



Sf(x)dx (1.8) 

1.3 Probabilistic concepts 

With respect to probabilistic concepts, the applied probabilistic notation prévalent in physics, machine 
learning, and statistics is employed. Specifically, probability spaces comprising an outcome set fl, a cr-algbra JL and 
a probability measure P, are not explicitly defined. Likewise, random variables will usually not be defined as 
measurable mappings from one measure space to another in the form 

X: (H, JL) -» (0.', Jl'), a) i ^ X{(o) (1.9) 

Rather, the exposition focuses on image measures, or the distributions of random variables, written as P{X) •■= 
Px{X), and generally defined by 

p(X) := P X {X £ A'} •■= P{ù) £ n\X(ù)) £ A'} {A £ JL') (1.10) 

For example, the statement 

p(x = l) = 0.1 (1.11) 

should be read as 'the probability that the random variable x takes on a value of 1 is 0.1'. More generally, arbitrary 
values that a random value may take on are denoted by a superscripted asterisk, e.g. p(x = x*). An explicit 
distinction between probabilistic statements as frequency limits or degrees of beliefs is omitted. In concordance with 
the notations in machine learning textbooks, the explicit differentiation between probability distributions, 
probability density functions, and probability mass functions is omitted and non-capital letters are used to dénote 
random variables and probabilities. Given the focus on Gaussian random variables in this tutorial, ail probability 
distributions employed will actually be probability density functions, and the learning of the parameters of 
probability distributions should be understood as probability density function parameter learning. Finally, while 
some univariate examples are discussed, the theory is in gênerai developed for random vectors, i.e. vectors of 
random variables of the form x = (x 1( ...,x d ) T where the entries x 1( ...,x d represent random variables, and d EN 
the dimensionality of the vector. This treatment subsumes the random variable case with d = 1. The most 
important aspect of this is to keep in mind that the covariances of vectors are not positive scalars as in the random 
variable case, but symmetric positive-semidefinite matrices. Occasionally, colloquial référence will be made to 
"random variables" in cases where the random vector concept is equally appropriate. 

As a notational example, the following conventions for writing bivariate and marginal probability 
distributions of random vectors x and y are used: 

p(x,y):X xy — > R+, (x = x\y = y*) >-» p(x = x\y = y*) (1.12) 

for a bivariate probability distribution, where X and y dénote the ranges of x and y, respectively, and 

p(x):X — > R+, x = x* i-> p(x = x*) = / p(x = x*,y)dy (1.13) 

for the corresponding marginal distribution over x. With respect to conditional probability distributions, denoted 
here as, 

p(x\y):Xxy^R + , (x = x\y = y*) ~ p(x = x*|y = y*) = fgg=g d-14) 

an important notational convention is employed to distinguish between probability distributions parameterized by 
fixed quantifies, i.e. 'nonrandom' variables and probability distributions conditioned on other random variables. 
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Specifically, pg(x) is written for a probability distribution over the random variable x, which is parameterized by the 
fixed (nonrandom) variable 6. Likewise, p$(x\6) is written for a probability distribution that is conditioned on the 
random variable 8 and parameterized by the fixed (nonrandom) variable i9. Finally, in the case of multivariate 
probability distributions, conditional distributions over subsets of variables are often referred to as conditional 
marginal distributions. 

Due to the fact that the parameters of a Gaussian distribution are équivalent with its first central moments, 
namely its expectation and covariance, fréquent use is made of expectation and covariance opérations on random 
vectors. Here, the notations E p ( Z )(/(x)) and (/(x)) P ( X ) are used interchangeably to dénote the expectation 
opération of a function / of the random vector x G X under the probability distribution p: 

(■).:T(x)xP(x) -» R, (p(x),/(x)) ~ </(x)> pW := / f(x)p(x)dx (1.15) 

and 

E.C-):T(x)xP(x) -» R, (p(x),/(x)) ~ E p00 (/(*)) := J f(x)p(x)dx (1.16) 

where T(x) and T(x) dénote unspecified 1 function and probability distribution sets over x and the intégration is 
performed over the range X of x. The reason for the use of thèse interchangeable notations is that (in the opinion of 
the authors) the bracketed notation {f{x)) p ^ emphasizes the 'operator' characteristic of an expectation slightly 
more, and is hence used mainly in algebraic manipulations, while the notation E p ( x )(/(x)) stresses the 'single 
value' characteristic of an expectation or conditional expectation slightly more and is hence used predominantly in 
the statement of results rather than calculations. Covariance opérations are usually denoted by the symbol C in the 
form 

<C(v): P(x,y) -» R+,p(x,y) <C P o, y) (x,y) := E p(xy) ^(x - E p(x) (x)) {y - E p(y) (y)) ^ (1.17) 

which for a random vector of dimensionality d G M refers to a symmetric, positive-semidefinite matrix of 
dimensionality dx d. The probability distribution subscripts in the abbreviations (x) p ^ x - ) ,E p ^(x) and £ p ( A ; j y)(x,y) 
are also often omitted, if the respective distributions is clearfrom the context. 

For random vectors x,y, i.e. vectors of random variables x = (x 1( ...,x d ) T and y = (y lt ...,y d ) T ,d G PJ and scalars 
a,b,c G E, some standard rules for manipulations involving expectations, variances and covariances are listed 
below. 

Expectation of an expectation of a RV 

E(E(x)) = E(x) (1.18) 

Expectation of the product of a scalar and a RV 

E(ax) = aE(x) (1.19) 

Expectation of the sum of two RVs 

E(x + y) = E(x) + E(y) (1.20) 



1 The spaces T and T are left unspecified to eschew a formai discussion of measure theoretic concepts. The same reasoning is 
behind the simplified probability concept notations discussed. 
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Alternative expression for the variance of a RV 



VOO = E(x 2 ) - E(x) 2 



(1.21) 



Alternative expression for the covariance of RVs x,y 



C(x,y) = E(xy) - E(x)E(y) 



(1.21) 



Covariance of a RV x with itself 



<C(x,x) = V(x) 



(1.22) 



Variance of a RV under scalar multiplication and addition 



V(ax + c) = a 2 V(x) 



(1.23) 



Note that (1.18) - (1.23) are readily transferred to the random vector case, if the corresponding rules of vector 
calculus are respected. 

Finally, in the notation of Gaussian distributions the letter \i dénotes the expectation parameter, a and I dénote the 
(co)variance parameter, and A and A dénote précision (inverse variance) parameters. Here the convention is 
followed that matrices are represented by capital letters and scalars by lower case letters. The Gaussian distribution 
probability density function themselves is denoted by N(x; \i, I), where x indicates the Gaussian random vector, and 

H, I its parameters (see Supplément Section 2 for détails). 

I. 4 Statistical concepts 

With respect to the notation of statistical concepts, we note that we will mainly be dealing with itérative 
estimation schemes. We will dénote the value that an estimated parameter takes on the ith itération of some 
itérative parameter estimation scheme is denoted by a superscript (i), such as 8^ l \fj.^ l \A^ l \l.^ . The superscript 
notation thus makes the usual 'hat' notation for parameter estimâtes (6, for example) redundant, which is hence 
omitted. 

Table 1 summarizes the notational conventions used in the tutorial and supplementary material. 
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Notation 


Meaning 


f:D -+R,x »f(x) 


/ dénotes a mapping of éléments x in its domain 
D onto éléments /(x) of its range k. 




— f dénotes the derivative of f, which 
corresponds to a mapping of x onto -^fOO- 


°- f:D ^ RiX „°- m 


For functions of multiple variables x Xl ...,x n (e.g., 
D c M. 71 ), -^-f dénotes the partial derivative of / 
with repsect to Xj. 




Intégral notation a, b indicate the lower and 
upper boundaries and f the intégration variable. 


p(x,y):Xxy -» E + 
(x = x*,y = y*) i-> p(x = x*,y = y*) 


Joint probability distribution notation as prévalent 
in applied mathematical texts. In the context of 
this tutorial, p(x,y) usually refers to a probability 
density function. 


p(x):X — > E+, 
x = x* i-> p(x = x*) := J p(x = x*,y)dy 


Marginal probability distribution as obtained by 
intégration ("marginalization") of continuous 
probability distributions. 


p(x = x*,y = y*) 

(x = x *,y = y*) h» pO = x*|y = y*) := 

J p(x,y = y*)dx 


Définition of conditional probability distibutions. 
The conditional distribution of x given y = y* 
equals the normalized joint probability 
distribution of x and y = y*. 


(|:r(x)x?W -> R 
(p(x),/(x)) h» (/(x)) pW == / f(x)v(x)dx 


Notation of the expectation opération for 
functions / of random variables x. 


E.{-):T(x) xT(x) -» R 
fofx") ffx^ >-> E r ^ff^l := f f fxWx^dx 


Alternative notation of the expectation for 
functions /" of random variables. 


£(;■)■■ Hx.y) -> M + 

/ t\ 
p(x,y) h> C pfey) (%,y) := Ep fey) (ï - E pW (x)) (y - E p(y) (y)J J 


Notation of the covariance for random variables. 


H, a, S, A, 


Symbols for the parameters of Gaussian 
distributions. 




Notation for "conditional parameters." The 
parameter of interest is subscripted with the 
random variable whose distribution it governs (x) 
and which is conditioned on another random 
variable (y). 




Symbols for the value of parameter estimâtes on 
the ith itération of an itérative parameter 
esiirnaiion aigoninrn. 


v s. n 
2, ?■ u 


h is d positi vc-QcTiniic matnx 




The space of k-times continuously differentiable 
functions on R 


M 


Déterminant of the matrix Z 


x ~ p(x) 


x is distributed according to p(x) 



Table 1 Notational conventions of the tutorial and supplément 
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2 Some Properties of Gaussian and Gamma Distributions 

The analytical properties of multivariate Gaussian distributions are of central importance for the VB 
approach for LGSSMs In this section, we summarize some of thèse properties in theorem form. Additionally, an 
overview is provided in Table 2 and Figure 1. 

2.1 Définition and central moments of the Gaussian 

A Gaussian distribution over a random vector, Le., a vector of random variables, x = —,x d ) T £ R d is 
defined by the probability density function 

2VzO) -N(x;n,t) = (27r)-f|2|4exp(-i(x-/i) T I- 1 (x-/i)) (2.1) 

Here, fi £ R d and the positive definite matrix 2 £ R dxd , ! > 0 are the fixed parameters of the distribution as 
indicated by the subscript and semicolon notation in p M ,2:(x) and N(x; fi, 2), respectively. If fi and 2 are themselves 
random variables, the statement above is written p(x\fi, 2) = N(x\fi, 2) and takes on its parameterized form only 
for concrète values of the random variables fi = fi* and 2 = 2*, i.e. p(x\fi = fi*,! = 2*) = N(x; fi*,!*). For the case 
of a univariate random variable x £ R , the probability density function is usually expressed as 

P^,g 2 (*) : = N(x; fi, a 2 ) = (27ra 2 )"2 exp (- ^ (x - //) 2 ) (2.2) 
with fi eR and a 2 £ R + . The first central moment or the expectation of a normal distribution is given by 

i.e., the parameter fi £ R d of a multivariate normal distribution N(x; fi, 2) is équivalent to the expectation of the 
identity function of x under the probability distribution N(x; fi,!.). The second central moment or the covariance of 
a normal distribution is given by 




i.e., the parameter 2 £ R dxd , 2 >■ 0 of a multivariate normal distribution N(x; fi,!) is équivalent to the covariance 
of the identity function of x under the probability distribution N(x;fi,!). Thèse two équivalences allow the 
expectation of the square of x under s (x) to be expressed in terms of the expectation and covariance of x under 
P^,z( x )> an important property, which is exploited on several ocassions in later sections. Specifically, because of 

V(x) = E(x 2 ) - E(x) 2 (2.5) 

one has for normally distributed random vectors x £ R d : 

E p m ,z(*)0* T ) = V P M ,.M<» + E p m , 2 wW E p m , 2 (x)W T (2-6) 
= C P^M(x,x) + E Pm2W (x)E Pmz(x) (x) t 

= {{x - (x) P ^(x)) (x - (x) Pmz(x) ) > Pmz(x) + (x) P ^(x)(x)l flxM 
= ! + fifi T 
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2.2 The précision parameterization 

For practical purposes univariate Gaussian distributions are often parameterized using a précision parameter instead 
of a variance parameter The précision is defined as the inverse of the variance A ■■= 1/a 2 . This yields the équivalent 
notation: 

PM = WOî/U- 1 ) = jïe-^ 2 (2.7) 

2.3 The completing-the-square theorem for the Gaussian distribution 

Thecompleting-the-squaretheorem statesthat ifxe U d ,b £ U d ,A £ R dxd ,A > 0, then 

exp (-\x T Ax + b T x) = Ntx-.A^b.A-^yjWnAY 1 | exp {±b T A^b) (2.8) 

The completing-the-square theorem allows identifying the parameters of a Gaussian probability density function of 
x £ R d based on a quadratic form x £ R d , given by the argument of the exponent on the left hand side of the 
theorem. 

o It is verified by substitution of N(x; A~ 1 b, A -1 ) on the right-hand side of the identity above as follows: 

N(x;A- 1 b,A- i y\(2nA)- 1 | exp {^b T A~^b) (2.9) 

= j l{2nArl 1 exp (--(*- A-HfAix - A-H) + -b T A~ l b) 

VK27M)- 1 | F V 2 V J V J 2 ) 

= exp (- \ (x T Ax - x T AA~ 1 b - b T A~ lT Ax + b T A~ lT AA^b) + \ b T A~ x b) 
= exp (- \ (x T Ax -x T b- b T A~ x Ax + b T A^b) + \ b T A~ x b) 
- exp (--x T Ax + -b T x + -b T x - -^A^b + -b T A~ 1 b) 

*V 2 2 2 2 2 ) 

= exp (—^x T Ax + b T x^ 

n 

2.4 The linear transformation theorem for the Gaussian distribution 

The "linear transformation theorem" states that if x ~ N(x; /i x ,I x ), where x,\i x £ £ R dxd ,I. x > 0, 

£ ~ iV(£; \i £ , I E ), where s, n E £ U. d , I E > 0, <C(x, s) = (<C(£, x)f = 0 £ U. dxd and A £ R dxd is a matrix, then 

Ax + e = :y~N(y; f i y ,ï. y ) (2.10) 

where y £ U d , \i y £ U d , I y £ U dxd , Z y > 0 and specifically, 

fi y = A\l x + n E and I y = AI X A T + S e (2.11) 

The proof that y is indeed a Gaussian random vector capitalizes on the transformation theorem for probability 
density functions and is suppressed here for brevity. 
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o Vérification of équations (2.10) and (2.11) 

If it is acknowledged that y is a Gaussian random vector, the expressions for the parameters of this Gaussian can be 
derived in an instructive manner from the fact that the parameters of a Gaussian correspond to its first two central 
moments and the gênerai rules for manipulating expectancies and covariances as follows: 

\i y := E(y) = E(Ax + s) = E(Ax) + E(e) = AE(x) + E(e) = A[L X + n Ë (2.12) 

and 

Z y =:E((y-E(y))(y-E(y)) T ) (2.13) 
= E {^Ax + e - E(Ax + e))(Ax + £- E(Ax + £)) T ) 
= E y(Ax + £- AE(x) - E(e))(Ax + £- AE(x) - E(») T ) 

= E ((a(x - E(x)) + (e - E(e))) (a(x - E(x)) + (s - E(»)) T 

= E (^A(x - E(x)) + (e - E(e))) ((x - E(x)) T A T + (s - E(») T )) 
= AE ((x - E(x))(x - E(x)) T ) A T + AE ((x - E(x))(e - E(») T ) 

+E ((s - E(£))(x - E(x)) T )^ T + E ((g - E(e))(e - E(») T ) 

= AC(x, x)A T + AC(x, s) + C(e, x)A T + C(e, e) 
= A1 X A T + A • 0 + 0 • A T + S £ 
— i4S^^4 ~\~ 

where in the penultimate line the independence property (C(x, s) := 0 of x and e was exploited. 

□ 

2.5 The marginalization and conditioning theorem for the Gaussian distribution 

The marginalization and conditioning theorem states that, if z~ N(z; fi, I), z, fi £ R d , I £ R dxd , I > 0 and 

z := Q , x £ R m , y £ R d ~ m (2.14) 
with corresponding partitions of the parameters and précision matrix A = I -1 £ R dxd , Le., with 

(Mx\ £ _ /^xx ^xy\ ^ _ l^xx ^xy\ ^ ^5) 



^xx ^xy\ , . I^xx A 

r I ) andA= U A 

j yx yy' v'yx yy 

then 



p(x) = N(x;ii x ,ï. xx ) (2.16) 

Further, 
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p(x\y) = N(x;n x]y ,T. x]y ) (2.17) 

where 

f^x\y = V-x ^xy^yy(y ~ My) anc ' ^x\y = ^xx ~ ^xy^yy^-yx (2-17) 

o Preliminaries 

To verify the marginalization and conditioning theorem, we follow (Bishop, 2007). The expressions for the 
conditional mean and conditional covariance can be verified using two intermediate steps. We first obtain some 
helpful relations between the éléments of matrices I and A. Using the fact that A = I -1 we have: 



(l XX l Xy )( A A X A Xy ) = 7 < 2 - 18 > 



f^xx ^"xy\ l^xx 
'-'yx ^yy' \^yx ^yy> 

Based on (2.18) one can infer the following identifies 

^■xx ^xy^yy^yx ^xx (2-19) 
^•xx ~ ^xy^yy^yx ~ ^xx (2.20) 

and 

^yy^yx ^yx^xx (2.21) 

We next consider the joint probability distribution 

p(x,y) = p(z) =N{z;n,T) (2.22) 

and define 

a ■= (x — fi x ) and b ■= (y — fi y ) (2.23) 
Then the quadratic form in the exponent of p(x,y) can be rewritten as: 

rArr A, 



_i & _ (0 T t - 1& _ (l) ._i (e r **)Q (2 . 24) 

= — ^{a T A xx a + a T A xy b + b T A yx a + b T A yy b) 

= — - (a T A xx a + a T A xy Ay y Ayyb + ô' F A yy A~ y A yx a + b T A yy b) 

y & (j^xx ^xy^yy^yx)^ ^ ^yy^yx^) ^yy(P ^yy^yx^} 



2 V a a - Ay y y y*> J 2 

Further, using the relations (2.18) - (2.21), (2.24) can be rewritten as either 



(2.25) 



or 
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- jO - M) 7 ^ Hz - M) = - j(* - - ^ y 2 y y ^ " %)) A « (* ~ ~ ^ y 2 y y(y ~ My)) ~ \ ~ My/^yy ~ My) 

(2.26) 

□ 

o Vérification of (2.15) 

To obtain the marginal distribution p(x), the joint distribution p(x,y) has to be integrated over y. To this 
end, we make use of (2.25) 

p(x) = Jp(x,y)dy (2.27) 
= (2tt)~2 |l|-2 exp [~\(x - fi x ) T ^(x - nS^j 

■ ! ex P ( - \{y - My - ZyxïxKx ~ Hx)) Ayy (y ~ My ~ ^yx 1 **^ ~ fe))^) dy 

= (2tt)"2|I|-2 e -^-M z ^ (x - fe) (27r)T|A yy | 2 

= N(x; \i x , % X x) 
This provides vérification of the statement (2.15). 



o Vérification of (2.16) 

We now consider the joint probability distribution p(x,y = y*), which, upon normalization, corresponds to 
the conditional distribution p(x|y = y*), because, by définition 

PWy = y*)==^Ç (2.28) 

Rewriting the conditional distribution using quadratic form (2.26) and summarizing ail terms independent of x in a 
constant C, one obtains 

p( x | y = y *) = c • exp (ç\(x - \i x - I xy I yy -(y* - /i y )) A xx (x-^i x - I xy I y ^(y* - fi y )^ (2.29) 
Assuming appropriate normalization we see that conditional distribution has a Gaussian form with parameters 

^x\y* := ^xx = ^xx ~ ^xy^yy^yx (2.30) 

and 

l^x\y* := "I" ^"xy^yyiy ~ My) (2-31) 
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Figure 1. Visualization of properties of the Gaussian distribution. Figure 1A visualizes the completing-the-square 
theorem. The left-hand panel depicts three différent exponential quadratic forms in x with parameters a and b as 
noted in the legend. For the same parameter settings, the right-hand panel depicts the évaluation of the first 
multiplicative term of the right-hand side of the completing-the-square theorem, Le., the Gaussian form with 
parameters \i = aT x b and a 2 = a -1 as dashed curves. The continuous curves represent thèse forms upon 
multiplication with the normalization factor, rendering the ensuing curves équivalent to the exponential quadratic 
forms in x depicted in the left-hand panel. Figure 1B visualizes the linear transformation theorem both on the 
analytical and realized level. The left-hand panel depicts the functional forms of two Gaussian distributions over 
variables x and £ as red and grey continuous curves, respectively. Additionally, a histogram estimate of both density 
functions based on N = 1000 samples is depicted as a bar graph. The right-hand panel depicts the ensuing 
functional form of the distribution of the random variable x + s as well as the resuit of a histogram estimate of the 
addition of the samples obtained for the left-hand panel. Figure 1C visualizes the Gaussian marginalization and 
conditioning theorem for a bivariate Gaussian distribution over z = (x,y) T . The leftmost panel depicts the 
functional form of the joint distribution over x and y. The marginal distributions resulting from the application of the 
marginalization équations are depicted in the second panel from the left. For values y = 1 and y = 2, the third 
panel from the left depicts the conditional marginal distributions p(x|y) as dashed and dotted lines, respectively; 
likewise, for values x = 1 and x = 2, the rightmost panel depicts the conditional marginal distributions p(y|x) as 
dashed and dotted lines, respectively. 
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Définition and notation of the multivariate Gaussian distribution for x £ R d 



P^W == N(x;n,l) = (27r)-f|I|4exp(-i(x " n)TKx - n)) 



Définition and notation of the univariate Gaussian distribution x G R 

P^.o 2 0) : = N(x; ^ ° 2 ) = (27ra 2 )"2 exp (- — (x - fi) 2 J 


Equivalent précision parameterization of the univariate Gaussian distribution x G M 


Expectation and covariance equalities and notation for multivariate Gaussian distributions 


)- 


General variance property and application to the multivariate Gaussian distribution 

V(x) = E(x 2 ) - E(x) 2 => E Piix{x) (xx T ) = <Cp M5;(x) (x,x) + E p ^ x{x) (x)E p ^ {x) (x) T = 




"Completing-The-Square Theorem" 

The completing-the-square theorem states that if x G R d , b G R d , A G R dxd , A > 0 then 
exp (-^x T Ax - b T x) = N(xM- 1 6,^- 1 )7l(27r^)- 1 | exp {^b T A^b) 


"Linear Transformation Theorem" 

The "linear transformation theorem" states that if x ~ N(x;/i x ,2 x ), where ,\i x G 
E dxd ^ > 0£ ~N(£;/i £ ,I £ ), 

where £, /i £ G M' 1 , I E >- 0, <C(x, s) = (<C(>, x)f = 0 G E dxd and ^ G R dxd is a matrix then 

Ax + s = :y ~ N(y;[i y ,I y ) 
where y G R d , [i y G R d , I y G R dxd , I y and specifically 

[i y = ApL x + \i £ and I y = ^4I X /1 T + I £ 


M d , I x G 



"Marginalization and Conditioning Theorem" 

The marginalization and conditioning theorem states that if z~ N(z; fi, S),z,jU G E d ,2 G M dxci ,I > 
Oand 

z := Q,x G M m ,y G R d ~ m 
with corresponding partitions of the parameters and variance matrix I G R dxd , Le., with 



(^x\ , „ f^xx ^xy\ 

then 

p(x) = iV(x; and p(y) = N(y;[i y ,I yy ) 

Further, 

p(x|y) = iV(x; ^ly, Z X | y ), where n x \ y = \i x + I xy I yy -(y - fi y ) and Z X | y = I xx - I zy 2 y ^2 xy 
Table 2. Some Properties of the Gaussian distribution. 
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2.6 Information theoretic quantifies for Gaussian probability density functions 

The differential entropy of random variable x governed by a probability density function p(x) is defined as (Cover 
and Thomas (2006), p. 243) 

H (p(x)) == - J p(x) ln(p(x)) dx (2.32) 

Without proof, we note that for a d-dimensional random vector x £ R d governed by the d-dimensional Gaussian 
distribution N(x; \i, I), this intégral évaluâtes to(Cover and Thomas (2006), p. 250) 

W(N(x;n,T.)) = iln |S| + \ (1 + ln 2n) = iln(Oe) d |Z|) (2.33) 

For the spécial univariate case £ R , the above simplifies to (Cover and Thomas (2006), p. 244) 

H (iV(x; [i, a 2 )) = ^ln a 2 + \ (1 + ln 2tt) = iln(27rea 2 ) (2.34) 

As noted in Appendix A of the tutorial, the Kullback-Leibler divergence from probability density function q(x) over a 
random variable x to a probability density function p(x) over the same random variable is defined as 

X£(q(x)||p(x)) == fq(x)lngg)dx (2.35) 

Without proof (Penny, 2001), we note that the KL divergence between two Gaussian probability density functions on 
a d-dimensional random vector x £ R d , Af(x; [i qi I q ) and N(x; fi p , I p ), is given by 

KL (jV(x; fi q , S q ) | \N(x; H , S p )) = ^ + \ trfe%) + \ (n q - ^fz' 1 (n q - n p ) - \ (2.36) 

where tr(A) = Y. d =i a u dénotes the trace of a matrix A £ R dxd . For the spécial univariate case £ R , the above 
simplifies to 

XL (s ( X ; $ | = ^ + ±^2!û - i ,, 37) 

2.7 The Gamma distribution 

The Gamma density is considered in its "shape and scale" parameterization given as 

G(X;a,b) := p^^^ a_1 ex P (~ ^) forA > 0 and a,b > 0 (2.38) 

where a is referred to as the "shape parameter" and b is referred to as the "scale parameter". T(x) dénotes the 
Gamma function which is defined for x > 0 as: 

T: E + \{0} ->i,xi-> r(x) = J 0 °° t x - x e~ l dt (2.39) 

The expectation and variance of A under G(A; a, b) are expressed in terms of the parameters as 

^G(A;a,b)W = ab and V G(A;ai6) (A) = ab 2 (2.40) 

The expectation of the logarithm of A under G{A; a, b) is given by 

E G(A;ai6) 0nA) = V(a)+ln6 (2.41) 

where xfj dénotes the digamma function. For x > 0 the digamma function may be presented in its intégral form as 
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xj): R + \{0] -»1,xh xj}{x) = Ç (Ç - dt (2.42) 

Without proof, we note that the differential entropy of a random variable A governed by a Gamma probability 
density function in its "shape and scale" parameterization is given by 

Jf (G(A; a, b)) = a + ln b + ln T(a) + (1 - a)^(o) (2.43) 

where T and i/j dénote the gamma and digamma functions, respectively. Likewise, without proof (Penny, 2001), we 
note that the KL divergence between two Gamma probability density functions on a random variable A > 0, 
G(A; a q , bq) and G (A; a p , ft p ), is given by 

XL{G(A; a q ,b q ) \ | G(A; a p ,b p )^ 

= (a" - l)V^(a q ) - ln b" - a" - ln r(o") + ln V(aP) + aP ln b? - (a? - l)(^(a^) + ln b") + ^ 

(2.44) 

where Y dénotes the gamma, and xfj dénotes the digamma function. 

3 Wiener Processes, Stochastic Intégration, and Euler Maruyama Discretization 

The aim of this section is to review a minimum set of concepts from stochastic analysis which forms the 
background for the embedding of (discrete-time) LGSSMs in the context of (continuous-time) stochastic differential 
équations as discussed in Sections 2.1 and 5 of the tutorial. Supplément Section 3 is not meant to provide a 
comprehensive discussion of stochastic analysis, but rather to collect some pointers to aspects of stochastic analysis 
the interested reader might pursue. Additionally, we hope to provide some interprétation for équations (1) -(6) of 
Section 2.1, as well as équations (36) - (37) of the tutorial example in Section 5. 

3.1 Wiener processes 

In this section, the intuition and mathematical formulation of Wiener processes is briefly sketched. For a 
comprehensive review of stochastic processes in gênerai and Wiener processes in particular, the reader is referred 
to (Bass, 2011). In the context of the tutorial, Wiener processes form the mathematical model of random 
fluctuations and the latent level (réf. équation 1 of the tutorial), which, by means of Euler-Maruyama approximation, 
results in the Gaussian state space évolution of the LGSSM (réf. équation 2 of the tutorial). Here, we proceed by 
introducing the définition of gênerai stochastic processes, the physical intuition of Wiener processes, the définition 
of Wiener processes as stochastic processes, and finally some properties of Wiener processes required for the 
discussion of stochastic intégration. 

General Stochastic Processes 

Informally, a gênerai stochastic process can be regarded as a set of random variables (X t ) teT indexed by an 
uncountable index set rel + that models continuous time. Formally, a gênerai stochastic process is defined as 
follows: Let (fî, c/Z, P) be a probability space and T an arbitrary nonempty set. Let X t : (H, cÂ) -* (X t , ® t ) be a (c/Z- 
S t )- random variable for every t ET. Then 

* T :=(«,«*, P,(JQ te7 .) (3.1) 
dénotes a gênerai stochastic process with parameter set T. 
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The uncountability of the parameter set T entails some mathematical difficulties for the définition of the 
outcome set H of a stochastic process, its corresponding cr-field c/Z, and the existence of the probability measure P. 
Thèse difficulties can be resolved using Kolmogorov's theorem (Bass, 2011), which allows for proving the existence 
of a stochastic process based on a consistent set of finite marginal distributions. In other words, the définition of the 
Wiener processes based on normally distributed incréments given below can be related back to the gênerai 
définition of stochastic processes by means of Kolmogorov's theorem. Below, we often use the simplified notation 
Z(t) for a gênerai stochastic process. 

Physical Intuition of Wiener Processes 

It is helpful to consider the physical interprétation of a Wiener process to motivate the formai définition 
below. In its physical interprétation, a realization of a one-dimensional Wiener process can be regarded as the 
projection of the displacement vector of a three-dimensional particle in a fluid onto a selected coordinate axis of R 3 
over time. Specifically, X 0 = 0 would dénote the particle centered on its starting point and X t the projection of its 
displacement from the starting point at time t onto e.g., the x-axis. Importantly, the particle is considered large with 
respect to the molécules constituting the fluid. In principle, based on a set of initial conditions, the movement of ail 
molécules in the fluid is determined by Hamiltonian (Le., generalized Newtonian) dynamics. However, because the 
number of molécules is conceived as very large, a statistical or probabilistic view is appropriate. The molécules of the 
fluid are imagined to collide with the large particle at a time scale that is short with respect to the time scale that the 
particle is observed on. In this sensé, the movement of the large particle in a time interval [ti,t 2 ] can be imagined as 
being the resuit of the summation of a large number of independent fluid-molecule-particle collisions, each evoking 
a minute movement of the particle. Informally, the central limit theorem of probability theorem states that the sum 
(or average) of a high number of independently distributed random variables (here the minute motions of the 
particle evoked by each molecule-particle collision) is normally distributed. Thus, displacement in X, i.e. the 
différence X t2 — X ti can be considered to be normally distributed, and, because the molécule impacts are conceived 
as independent and evoking motion in ail spatial directions, have an expectation of zéro, i.e. E(X t2 — X t ^) = 0. In 
other words, most likely, due to the complementary nature of the summed impacts between t x and t 2 , the particle 
stays were it is. Further, if the characteristics of the fluid do not change over time (e.g. the fluid is not heated up, 
causing stronger impacts of the molécules), the distribution of the spatial distance in a time interval [t x , t 2 ] should 
be the same as the distribution of the spatial distance in a time interval [t x + h, t 2 + h], with h > 0. Finally, the 
particle motions evoked by the impact of fluid molécules are considered to be independent over disjoint time 
intervais. 

Définition of Wiener Processes 2 

The intuition of a Wiener process realization as a large particle colliding with many small fluid molécules at a 
short time scale with respect to the observation time scale of the movement of the large particle yields the following 
set of formai requirements for a mathematical model (fl, c/Z, P, (X t ) teT ) of Wiener processes: (fl, c/Z, P, (X t ) tET ) is a 
stochastic process with parameter set T = R + , for which the following requirements hold: 

(1) Initial value X 0 = 0 P-almost everywhere (i.e., if there exists N c fl for which X 0 (a) N ) =é 0, u) N £ N, then 
P(N) = 0. In other words, P{X 0 = 0} = P({a) £ n\X 0 ((û) = 1}) = 1). 

(2) Normality For every t > 0,X t is normally distributed with expectation 0 and variance parameter v(t). Here 
!?(■) dénotes the variance of the normal distribution of X t as a function of t. 

(3) Stationarity and Independence (fl, c/Z, P, (X t ) tET ) is a stochastic process with stationary incréments, i.e. the 
distribution Px t -x of the incréments X t — X s (s,t £ T,s < t) only dépends on the différence t — s, but not 
on t or 5 itself. In other words, for h > 0, we have Px t+h -x s+h = Px t -x s - Note that with X t and X s also the 
incrément X t — X s is a random variable. Furter, (fl, c/Z, P, (X t ) teT ) is a stochastic process with stochastically 



2 In the following section, |5| dénotes the cardinality of a set S, not a déterminant. 
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independent incréments, i.e. for ail / = {t 1( ...,t n } £ K(T) (where W(T) ~ {J\J cT,0<|/|< oo}) with 
n > 2 and t x < ••• < t n the random variables (incréments) X t2 — X tl , — ,X tn — X t are stochastically 
independent. 

Note that, formally, the mathematical framework of stochastic processes requires an existence proof of the 
postulated probability measure P on (fl, JV) , which can be achieved by showing the consistency of the set of 
marginal distributions defined above. We eschew this existence proof here and instead note some informai 
properties of Wiener processes required for the concept of stochastic intégration. 

Properties of Wiener processes 

We first reformulate the Wiener process without référence to the underlying measure space and, from now 
on, dénote the Wiener process by W(t), t £ [0, T] (Hassler, 2007). Informally, W(i) is a process with a starting value 
of 0, and independent, normally distributed, stationary incréments. More formally, we can restate (1) - (3) in the 
définition of the Wiener process above as follows 

(1) Initial value The initial value of the Wiener process is zéro with probability 1, i.e. P(W(0) = 0) = 1. 

(2) Normality The incréments WÇt-J - W(t 0 ), W(t n ) - W^^), where 0 < t 0 < t x < ••• < t n are 
independent for n £ M. 

(3) Stationarity and Independence The incréments W(t) — W(s) are normally distributed with expectation 0 
and variance equal to the temporal différence t — s for 0 < s < t, i.e. W(t) — W(s) ~ N(W(t) — 
W(s); 0, t — s). Notably, the variance of the incréments is stationary, i.e. it only dépends on the différence 
t — s, but not the absolute values of s and t, and the joint distribution of the incréments is multivariate 
normal with a diagonal covariance matrix. 

The Wiener process is defined by the properties of its incréments. Nevertheless, the Wiener process can, inituitively, 
be conceived as a stochastic function of the form W: [0, T] -» R, t i-> W(t) where W(t) is distributed according to a 
normal distribution N(W(t); 0, t). This is readily seen by considering the distribution of the incrément W(t) — 
W{0) for t £ [0, T]. According to (3) W(t) - W(0) ~ N(W(t) - W(0); 0, t - 0) and according to (1) W (0) = 0 
with probability 1. Based on the properties of normal distributions, we can thus infer that E(W(t)) = 0 and 
V(W(t)) = t. 

Gaussian Processes 

Equivalently, a Wiener process W(i) can be conceived as a Gaussian process. A Gaussian process is a 
stochastic process (X t ) tET with a countable index set T := {t 1( ...,t n }, t 1 <t 2 < ■■■ < t n , where |T| = n, n £ M, 

which is multivariate normally distributed. In terms of Gaussian processes, the vector W(t) •■= (W^t-^, ...,I/K(t n )) 
is distributed according to N(W(t); 0,2), where the entries of the covariance matrix I £ R nxn , > S are given by 
= min(t(,t 7 ) (for a proof, see e.g. (Hassler, 2007)). Note in particular, that the diagonal éléments of this 
covariance matrix are increasing and that the off-diagonal éléments are non-zero, i.e. modeling stochastic 
dependencies. 

3.2 Stochastic intégration 

Stochastic intégrais, unlike the intégrais known from standard calculus and analysis, are random variables, 
implying that the values they assume display some "randomness". The aim of this section is to introduce the Ito 
intégral of a stochastic process as the prerequisite for the framework of stochastic differential équations. To help 
with its introduction, the Ito intégral is developed as the endpoint of a "stochastification" of the Riemann intégral 
known from standard calculus. The development of this section is largely based on (Hassler, 2007), Sections 6 - 11. 

Convergence in the quadratic mean 
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In this section, we lay the foundations for the définitions of stochastic intégrais as the limits of their 
corresponding sum expressions, in analogy to the well-known quadrature expressions for Riemann intégrais in 
deterministic analysis. To this end, we require a convergence notion for random variables and discuss the supporting 
interval partition used in the subséquent sections. 

The convergence form required in the définition of stochastic intégrais below is "convergence in the 
quadratic mean". A séquence of random variables (X n ) neM is said to converge to a random variable X in the 

2 

quadratic mean, if E((Z n — X) 2 ) -* 0 for n -> oo , which will be denoted as X n -* X. Without proof, we state the 
following (Cauchy) convergence criterion: A séquence of random variables with E(X%) < oo converges in the 
quadratic mean, if for arbitrary n and m, E((X m — X n ) 2 ) -> 0 for m, n -> oo, or, equivalently, if for arbitrary n and 

m, E(X m X n ) -* c < oo for for m, n -* oo, c £ E. 

The concept of convergence in the quadratic mean can be demonstrated using a simple example: Consider 

the séquence (_X n ) nEM of random variables defined by X n •■= - J^ =1 Yi for n = 1,2, ... where the Y u i = 1, ...,n are 

independent, normally distributed random variables with mean 0 and variance a 2 , i.e. Y t ~ N(Y t ; 0, <r 2 ) for 
i = 1, ...,n. Using the Cauchy convergence criterion above, it can be readily shown that this séquence converges in 
the quadratic mean (assuming E(X%) < oo) as follows 

e(x M = e gif =1 Yi ■ ^t =1 n) = ^T=T m) m 2 ) = ° 2 - o 0.2) 

forn, m -» oo. in the above, the second équation follows, because the expectations of the products Y t • Yj for i =t 
j are 0, as the Y u i = 1,2, ... are assumed to be stochastically independent. 

Interval partitions 

In order to be able to define the intégral of a function on the interval [0, T] c R, the interval [0,T] is 
partitioned into disjoint subintervals of the form 

S n C[0, W ■= [t 0 , h] U [d, t 2 ] U ... U [t n _i, t n ] (3.3) 

where 0 = t 0 < t x < t 2 < ■■■ < t n _ x < t n = T. It is assumed that this partition becomes more and more fine 
grained as n increases, in other words: max 1 <j< n (t( — t^) -> 0 for n -» oo. A typical example of a partition that 

iT 

fulfills this criterion is the equidistant partition of [0, T], given by t t = — (i £ N n ), for which one has 

k-ti-i = i T --ti-V T - = l (ieN„) (3-4) 

T 

which obviously fulfills lim n ^ œ - = 0. We note that for any deterministic function /: [0, T] -> M, 

S?=i/(ti) - /(t t _0 = /(t x ) - /(t 0 ) + /(t 2 ) - /(t x ) + - + /(t n ) - /(t n _!) (3.5) 
= /(tn) - /(t 0 ) 

=f(T)-no) 

and thus, for the identity function, we have 

Y.Uh-t i - 1 = T-0 = T (3.6) 
An arbitrary point of support in the interval [t £ _ x , t £ ] will be denoted by t ( *. 
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The stochastic Riemann intégral 

Informally, stochastic Riemann intégrais are the usual Riemann intégrais with a stochastic process being part 
of the function integrated over. If the stochastic process in question is a Wiener process, the stochastic Riemann 
intégral is normally distributed with an expectation of zéro and a closed form équation for its variance can be 
obtained. Formally, the gênerai stochastic Riemann intégral over the product of a deterministic function / and a 
gênerai stochastic process X(t) is defined as the limit of the Riemann sum 

R n = ïummïxti-t i - 1 ) 0.7) 

If the limit of this sum convergences in the quadratic mean for n -* oo independently of the form of the partition 
S n ([0, T]), this limit is defined as the stochastic Riemann intégral: 

SZfCs)X(s) ds == lim n ^ œ ZUfitOXCtîXti - t t _i) (3.8) 

We next consider stochastic Riemann intégrais, for which the stochastic process is a Wiener process. For a 
deterministic function /, the stochastic Riemann intégral of the product between / and a Wiener process W(i) 
corresponds to a normal distribution with expectation 

E(j 0 T f(s)W(s)ds) = 0 (3.9) 

and covariance 

C(fimw(s)ds, fifCs)Wis)ds) = iïiïf(s)fCi)mm(s,i)dsdt (3.10) 

(fora proof of the above, see (Hassler, 2007)). 

The Riemann-Stieltjes intégral 

Riemann-Stieltjes intégrais are solutions for spécifie stochastic differential équations and are normally 
distributed random variables. The integrand of a Rieman-Stieltjes intégral is a deterministic function /, however, this 
is integrated with respect to a Wiener process, which defines the Riemann-Stieltjes sum as follows 

rs u = J2 = if(t;)(w(tù - w(u-i)) (3.ii) 

As for the stochastic Riemann intégral, if the limit of this sum for n -> oo exists in the quadratic mean sensé 
independently of the form of the partition S n ([0, T]), this limit is defined as the Riemann-Stieltjes intégral 

f*f(s) dW(s) == lim n _ J^ =i m)(W(tt) - WQti-J) (3.12) 

As a resuit of being the limit of a sum of normally distributed random variables the Riemann-Stieltjes intégral is 
normally distributed with expectation 

E(j 0 T f(s)dW(s)) = 0 (3.13) 

and covariance 

c(^f(s)dW(s),^f(s)dW(s)) = £f 2 (s)ds (3.14) 

(for a proof of the above, see (Hassler, 2007)). To see that for the Riemann-Stieltjes intégral of the constant unity 
function 
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ÛdW{s) = W{t 2 )-W{t 1 ) 



(3.15) 



we consider its relation to the concept of partial intégration as known from the calculus of standard Riemann 
intégrais. Recall that, for two deterministic integrable functions /: [0, T] -* M. and g: [0, T] -* M., the following rule of 
intégration holds 

Sof(t)g'(t)dt = f(t)g(t)\ T 0 - fif'(t)g(t)dt (3.16) 

For the Riemann-Stieltjes intégral, one has (changing the upper intégral bound from T to t £ [0, T], and the variable 
of intégration from t to s, for a proof see (Hassler, 2007)) 

fimdWis) = /(s)W(s)|S - J 0 V(s)d/(s) (3-17) 

which can be reformulated as 

ftns)dW(s) = f{t)W{t) - ^W(s)f'(s)ds (3.18) 

Finally, we consider the spécial case of the Riemann-Stieltjes intégral of the unity function /(t) := 1. Substitution 
into the équation above yields 

/„' dW{s) = W(f) - W(0) - £ W(s) -0ds = W(f) (3.19) 
More generally, on the interval [t 1( t 2 ] c [0, T], one thus has 

£ dW(s) = W&f? - £ W{s) -Qds = W(t 2 ) - WitJ (3.20) 

The Ito intégral 

The Ito intégral provides the gênerai basis for stochastic differential équations as discussed below. Here, we 
briefly discuss its définition and some of its properties. The gênerai Ito intégral of a stochastic process Z(t) with 
respect to a Wiener process W(i) is defined as limit of the Ito sum 

In ■■= E?=i*(t £ -i)0nt £ ) - Wft.O) (3.21) 

where, importantly (and in contrast to other stochastic intégrais, as e.g., Stratonivich intégral) the point tj_ 1( Le., 
the left border of the partition interval [tf_i, t;] serves as supporting point. If X(i) is a stochastic process with finite 
variance, and if X(t) is independent of the future of the stochastic process, the Ito sum converges to a well-defined 
limit in the quadratic means sensé and is defined as the Ito intégral 

£x(t)dW(t) == lim n ^If =1 ^(t£_ 1 )((^(t £ ) - Wfa-i)) (3-22) 

The Ito intégral is a random variable and its first two central moments can be shown to be given by (for a proof see 
(Hassler, 2007)) 

E^X(t)dW(t)) = 0 (3.23) 

and 

C (/ 0 T X(t) dW(t),fix(t) dWQtj) = J 0 T E(Z 2 (t)) dt (3.24) 

In the degenerate case that the stochastic process Z(t) corresponds to the unity function, Le., that P(X(t) = 1) = 
1, the Ito intégral évaluâtes to the Riemann-Stieltjes intégral discussed in the previous section, which in turn 
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évaluâtes to the incrément of a Wiener process (as seen above, again changing the upper intégral bound from T to 
t e [0, T], and the variable of intégration from t to s): 

ftxis) dW(s) = f< 1 dW(s) = W(t) (3.25) 

3.3 Stochastic Differential Equations 

Diffusions and Stochastic Differential Equations (SDEs) 

A diffusion is a stochastic process comprising the sum of a stochastic Riemann intégral and an Ito intégral in 
the form 

X(t)=X(0) + S*ix(.X(sls)ds + S t 0 o-(X(sls)dW(s) (3.26) 

where X(0) indicates the starting point of the stochastic process X. Here, the functions fi and a are referred to as a 
drift function and volatility functions, respectively, and are functions of time and the diffusion process itself. The first 
intégral term, a stochastic Riemann intégral, models a deterministic drift of the process X, while the second intégral 
term, an Ito intégral, models a random component. Taking the derivative with respect to time on both sides of the 
diffusion équation above yields its differential form 

±X(t) = £ (X(0) + fin(XCs), s)ds + ï<<T(.X(.s), s) dW(.sj) (3.27) 

which is usually denoted as 

dX(t) = fi{X(f), t)dt + o(X(t), t)dW(t) (3.28) 

Informally, the équation above states, that the change of the process X within an "infinitésimal" small time 
interval at time point t is given as the sum of (1) the product of a deterministic function fi (depending on the state of 
the process X at time point t and time itself) with the "infinitésimal" length of the time interval dt and (2) the 
product of a deterministic function a (also depending on the state of the process X at time t and time itself) with the 
"infinitésimal" small incrément of a Wiener process at time t, which is randomly distributed. The classification of 
stochastic differential équations (SDEs), as well as the study of existence and uniqueness properties of their 
solutions, forms an intégral part of stochastic analysis. The interested reader is referred to (Kloeden & Platen, 1999) 
for an in depth analytical and numerical treatment of SDEs. Here, we briefly introduce the set of narrow-sense linear 
SDEs with the aim of introducing the Langevin équation as provided in équation (4). 

Autonomous narrow-sense linear SDEs and the Langevin équation 

An SDE of the type 

dX(t) = n(X(t), t)dt + o(X(t), t)dW{t) (3.29) 

is called a linear SDE, if the functions |i:lx [0, T] -> M and a:Rx [0, T] -> R take the form 

fi(X(t), t) == Mi(t)*(t) + n 2 (t) and a{X{t), i) ~ a^XÇt) + a 2 (t) (3.30) 

for coefficients \i\,\i 2 > °i> °2 : lA T] -* M. If ail coefficients are constants, the linear SDE is called autonomous. 
Further, if [i 2 (t) = c 2 (0 := 0, the linear SDE is referred to as homogenous. Finally, the linear SDE is referred to as 
linear in the narrow sensé, if o"i(t) == 0, or in other words, the noise term is additive. The autonomous narrow-sense 
linear SDE thus has the gênerai form 

dX(t) = QiiXÇt) + pL 2 )dt + a 2 dW(t) (3.31) 
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Setting \i x •■= a,\i 2 •■= 0 and a 2 ■= a, one obtains the "latent linear diffusion process" of équation (4) in the tutorial, 
commonly known as the Langevin équation: 

dX(t) = aX(t)dt + odW(t) (a < 0, er > 0, t £ [0, s]) (3.32) 

The gênerai solution of the Langevin équation is given by (for a proof see (Hassler, 2007)) 

X(t) = exp(at)Z(0) + exp(at)Z(0) /„' a exp(s) dW{s) (3.33) 
which spécifies a homogenous Gaussian process with expectation 

E(X(t)) = exp(at) E(Z(0)) (3.34) 

and variance 

V{X{tj) = exp(at) V(Z(0)) + ^ (1 - exp(at)) (3.35) 

3.4 Euler-Maruyama discretization 

A commonly employed discretization technique for SDEs is the Euler-Maruyama method (Hassler, 2007; 
Kloeden & Platen, 1999), which defines a recursive scheme for the numerical évaluation of SDEs on a time interval 
[0, s] c R. The Euler-Maruyama method is based at a set of discrète time points t t £ [0, s], i = 0,1, ... , n with 

0 = t 0 < t x = ^ < ••• < û n _! <t n = s 

which representthe discretization of the observation time interval 

[0,s]=U? =1 [^s,is[ (3.36) 

into n equisized bins of length At := ^ , and which in the context of the tutorial may be conceived as being provided 

by the sampling rate of the measurement or observation process. The first step towards the Euler-Maruyama 
approximation of (3.32) to rewrite it in its intégral form given by 

X(t) = Z(0) + / 0 f aX{r)dT + adWiz) t £ [0, s] c R (3.37) 

In (3.37) the first intégral term dénotes a stochastic Riemann intégral and the second term dénotes a Riemann- 
Stieltjes intégral. Considering this intégral form on the subinterval [t;_i, t; [ c [0, s] (i = 1, ... , n) then yields 

Xfa) = Xit^) + aX(r)dT + adW{j) (3.38) 

By approximating the value of X{s) for s £ [t £ _ x , t £ ] by its left endpoint value, that is, by setting X{s) •■= X{ti_{) for 
ail s £ [ti_i, t £ ], one obtains the Euler-Maruyama approximation of the stochastic differential équation 

X{t t ) = jr(t t _!) + aXit^dz + £' adW{r) (3.39) 

L l-1 L l-1 

= Z(t £ _ x ) + aXit^) dr + a dWQr) (3.40) 

for i = 1, ... , n. Riemann intégration of the first intégral term yields 

f fi dr = T\ t t i = t, - t;_ x = - = At (3.41) 

and Rieman-Stieltjes intégration of the second intégral yields 
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kL dw(j) = witù ~ w(tt ~ i) (3A2) 

From the définition of Wiener processes we note that that 

W(U) - WiU^) = W(^)-W = W(àt) (3.45) 

is normally distributed with expectation parameter 0 and variance parameter At.To rewrite 

dX(i) = aX(t)dt + odW(t) (3.46) 
in the familiar autoregressive process of order 1 (AR(1)) form 

x t = ax t _ x + s t , e t ~N(e, ; 0, a 2 ) (3.47) 

we thus start by considering 

X(t t ) = Z(t£_ x ) + a^(t £ _ 1 )At + a(W(ti) - Wfa-à) (3.48) 

and change the notation by setting 

x t . == X(ti), w tt : = W(t0 and e t . ~ w t . - w t ._! (3.49) 

resulting in 

x t . = x ti _i + ax tl i At + as t . where s ti ~N(e ti ; 0, At) (3.50) 
Nextwe replace the discrète real time points 0 = t 0 , ... , t n = s by integer indices t = 1,2 T ■■= n + 1 resulting in 

x t = x t-i + ocx t _i^t + as t where e t ~N(e t ; 0, At) (3.51) 
Using ¥(aX) = a 2 ¥(X), we can rewrite the above as 

x t = (1 + aàt)x t _ 1 + s t where e t ~N(e t ; 0, a 2 At) (3.52) 

Finally, we define 

a := (1 + aAt) and o 2 ~ o 2 At (3.53) 

and obtain the AR(1) form 

x t = ax t _ x + £ t where e t ~N(e t ; 0,a 2 ) (3.54) 
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4 An Introduction to Variational Calculus 

The basic problem of variational calculus is the optimization, i.e. minimization or maximization, of 
functionals with respect to their arguments (informally, a functional is a "function of a function". It takes a function 
as its argument and maps it onto a real-valued scalar). In other words, given a functional T defined on a function 
space V, the aim of variational calculus is to find an élément y of V for which T assumes an extremal value. This 
basic problem can be understood in analogy to the maximization/minimization of a function defined on a standard 
vector space, such as M or IR n . In fact, the basic approach of solving the variational optimization problem shows 
many intuitive similarities with approaches from ordinary calculus (Figure 3, panels A and B). In ordinary calculus, the 
argument x that maximizes a given function / is usually found by first identifying the stationary points of /, i.e. those 
arguments x, for which the derivative or gradient of / vanishes. This is the necessary condition for an extremal 
value: If / has a maximum or a minimum in x, then its gradient in x is equal to zéro. However, the condition of a 
vanishing gradient is not sufficient for a (local) extremal value: The gradient may also vanish for an inflection point or 
for constant parts of /. Similar considérations apply for the calculus of variations. The formai treatment of necessary 
and sufficient conditions of functionals is a central topic of functional analysis, just as real analysis represents a 
formai treatment of ordinary calculus. In the current tutorial, functional analysis is eschewed and only the necessary 
condition for extremals of functionals will be considered. This approach is somewhat justified by the applied 
character of this tutorial and the intuition that the functionals considered are usually convex . For a convex 
functional the necessary condition for an extremum is also sufficient for a global minimum as it is for convex 
funvtions (Boyd & Vandenberghe, 2004). The basic problem of variational calculus may thus be formalized as 
follows: Let V be a vector space of functions, for example, the space of ail real-valued differentiable functions on the 
real line Y •■= {y £ Further, let T be a real-valued functional on Y, i.e., let T be a mapping that takes an 

élément y of Y as input and maps it onto the real line, formally T: Y -> M. Then, the basic problem of variational 
calculus may be stated as finding y* e Y for which T assumes an extremum (a minimum or a maximum): 

y* := argmin yel , T{y) = argmax y6l ,(-J'(y)) (4.1) 

A necessary condition for such an extremal argument (and, by appealing to the intuition from ordinary 
calculus, an équation of the type /'(x) = 0 which can be solved for the respective argument x) can be established 
by introducing the so-called "Gâteaux derivative" of :F:Let y* be a minimum of T and let v £ Y be another function 
in Y. Because Y is a function space, y* + sv for s £ R is also an élément of Y. Moreover, because y* is a minimum of 

7 

7{y* + sv) > T(y*) (4.2) 
This considération allows for defining a function of s as follows: 

h y * iV : U^U, s» h y * iV {£) := T{y* + sv) (4.3) 

By définition, this function has a minimum in s = 0 as T{y* + 0 • v) = T{y*) < T{y* + sv). Thus, if h y * v is 
differentiable, its derivative with respect to s vanishes in s = 0, in other words, 

Y e V,*0)l £ =o = lim E ^ 0 ^ ^ L2 = 0 (4.4) 

This derivative of h y * v in s = 0 (if defined) is referred to as the Gâteaux derivative of T and is usually denoted as 
ST(y*, v). One thus obtains 

ÔT(y*,v) == \ïm^ 0 nr+£V ^ nr) = 0 « £ h y%v {s)\ E=0 = 0 (4.5) 
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A necessary condition for an extremal value of the functional T can thus be expressed as follows: Let y* be an 
extremal value o\T, Le., y* ■= argmin y6l , T(y). Then it follows that 

ST(y*,v) = 0 (4.6) 

for ail v £ Y given that ST(y*,v) exists. This condition thus provides an approach for determining an extremal 
argument of a functional: after determining 8T(y, v) for a given functional T, set 8T(y, v) = 0 and solve for y. This 
approach yields the "stationary point(s)" of T , which can then be investigated with respect to their extremal 
properties. Further, in the case of a convex functional T, thèse solutions already constitute the minimum 
argument(s). 

4.1 The Variational Problem with Fixed Endpoints 

Above, a necessary condition for an extremal argument y* of a functional T was introduced within a very 
gênerai context. Here, this gênerai approach is made concrète for a spécifie class of functionals (subsuming, in a 
sensé, the variational free energy considered in Section 3 of the main tutorial) and is demonstrated by means of an 
example. Specifically, by limiting the function space to real-valued functions on an interval [x 0 , x x ] and specifying the 
class of functionals as the intégrais of thèse functions, one arrives at the Euler-Lagrange équation for functional 
optimization. As will be shown below, the Euler-Lagrange équation defines the extremal argument y* of a functional 
T by means of a second-order partial differential équation, which can be solved using methods from ordinary 
calculus. 

The variational fixed endpoint problem may be stated as follows: let (x 0 ,y 0 ), (x^y-J be éléments of M 2 and 
Y the vector space of ail twice-differentiable functions y on (x 0 , x x ) which fulfill the endpoint conditions y(x 0 ) = y 0 
andy(x x ) = y x , Le., 

Y == {y £ C 2 [x 0 , x x ] |y(x 0 ) = y 0 , y(x x ) = y x } (4.7) 
Further, let L be a twice partially differentiable, real-valued function defined on the set R x M x [x^x-J: 

L: 1 X 1 X [x 0 , x x ] M (4.8) 
Then, the variational fixed endpoint problem consist of finding a y* £ Y such that the functional J'defined by 

T:V ^ ffi,yH?(y) := f* 1 L(y(x), y'(x), x) dx (4.9) 

becomes extremal, in other words, finding 

y* := arg mm yev Q L(y, y', x) dx (4.10) 

Note that, y and y' refer to both the functions y, y' EV and the real coordinates y(x),y'(x) £ M of the function L 
for x £ M. Specifying the Gâteaux derivative of T as in (4.5) yields the following necessary condition for an extremal 
value in the current fixed endpoint case 

ÔT(y*,v) = ^QKy* + ev,y*' + ev',x) dx\ £=0 = 0 (4.11) 

with v £ V where, to respect the boundary conditions, 

V == {v £ (^[xcXilKxo) = 0,v(xi) = 0} 
so that y + v £ Y for arbitrary y and v. 
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It can be shown that the derivative of the intégral in (4.11) may be reformulated such that the necessary condition 
above may equivalently be expressed as (see below for détails; for simplicity of notation, the asterisk denoting the 
extremal argument is suppressed in the following équations, which also emphasizes the notion of y being a variable 
which is solved for) 

ST(y,v) = 0 o ^l{y,y',x)-^L{y,y',x) = 0 (4.12) 

The équation on the right-hand side of the expression above is referred to as the Euler-Lagrange équation and 
represents an (implicit) second-order partial differential équations for y. This can be seen by evaluating the total 
differential of its second term resulting in (see below for détails) 

è L(y ' y, ' x) = ^ L(y ' y, ' x) ° h L{y ' y ' ,x) = ^b L(y ' y, ' x)y ' + â£p L (y>y'> x w + ^y-, L (y>y'> x "> 

(4.13) 

This implicit partial differential équation for y may be made explicit by dividing the second partial derivative of the 
Lagrange function with respect to y' and evaluated using standard methods from ordinary calculus. The resulting 
y E V is a stationary point of T and a candidate for an extremal argument. Again, if T is convex, it is also already 
known to be an extremal argument. In the following example, the calculus of partial differential équations is 
eschewed, as the functional considered is an explicit function of y' only. 

o Dérivation of équations (4.12) and (4.13) 

In this section, we dérive the Euler-Lagrange équation 

^L(y,y',x)-^L(y,y',x) = 0 (4.14) 
based on the necessary condition for a minimum of 

T(y) == 4 Xl L(y(x),y'(x),x) dx (4.15) 

given by 

ST(y,v) = ^QKy + sv,y' + ev'.x) dx\ E=0 = 0 (4.16) 

To this end, we first note the définition of the total differential for a function /: M 3 -> R and y := (y,y',x) T ,v •■= 
(v, v' , 0) T , e £ R given by 

+ ev)\ E=0 = Ef=i^/(y) • v t (4.17) 
Because the Lagrange function L is differentiable by définition, we thus obtain 

TeQ 1 ^ + £V,y ' + £V '' X ^ dx l £=0 = Qj~ £ ( L ( y + £V,y ' + £V '' x ^\s=o dx 

= Q^L{y,y',x) ■ v + -^jL(y,y',x) ■ v' +jpL(y,y',x) ■ 0 dx 



= C è L(y ' y '- x)v dx + C h L{y> y '- x)v ' dx 

(4.18) 
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The second intégral term is next integrated using partial intégration. In gênerai we have 

Saf'g = fg\ b a -£fg' (4.19) 

Here, we have 

a = x ll b = x lt g = ^jL, g' = ^jy 1 ' f = v, f = v' (4.20) 

and thus obtain 

j Xl ^L ■v'dx = ^ 7 L- vfr -f Xl v- -^L (4.21) 
J x 0 dy' dy' ,x o J x 0 dx dy' v ' 

Written in full, the above yields 

= g^L(yix),y'Cx),x)vCx)\xl - Q^^7) L (y{x),y'{x),x)v{x)dx 

= ^^Ky(*i),y'(*i),*iM*i) - ^r^L(y(x 0 ),y'(x 0 ),x 0 )v(x 0 ) - Q-^^t^L (y(x),y'(x),x)v(x)dx 

(4.22) 

As we are considering only v for which v(x 0 ) = v(x-l) = 0, we have 

Q-^ L (y(. x )>y'(. x )> x >' dx = -Ç£^MyW,y'(x),xMx)dx (4.23) 

Substitution into the équation above results in 

ÔT(y, v) = 0 

° Giy\x~) L (yW'y'W' x ^ v w dx ~ Qd\dy^) L (yw.y'w.^^w^ = o 

<=> Q(^L(y(x),y'(x),x)-^^^L(y(x),y'(x),x)^v(x) dx = 0 

(4.24) 

We now make use of the following Lagrange-Lemma, which we state without proof: 
Lapranpe-Lemma 

Let f be a continuous real-valued function on the interval [x 0 , x x ] and let 

Qf(x)v(x) dx = 0 

for ail continuously differentiable, real-valued functions v that fulfill v(x 0 ) = v{x{) = 0. Then it follows that 
/(x) = 0 on [xq.Xi]. 

With this Lemma, we thus obtain the Euler-Lagrange équation 

ST(y,v) = 0 « ^L(y(x),y'(x),x) - ±^L(y (x),y'(x),x) = 0 (4.25) 
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Finally, we note that the Euler-Lagrange équation constitutes a second order partial differential équation for the 
function y, because writing out the differential of the second term on its left hand side yields 

(4.26) 

Written in fui I, the Euler-Lagrange équation in its gênerai form thus corresponds to 

^- )L (y(x),y'(x),x) - gy J 2 gyrM L(y(x),y'(x),x)y'(x) - ^|^ L(yM,y'(4%''W - ^g^L(y(*),y'(*), *) = 0 

(4.27) 

□ 

Example: The Shortest Distance Between Two Points in R 2 

As an example, we consider the following problem: Let (a, a), {b, /?) £ R 2 , a < b. The vector space Y may be 
defined as 

Y := {y G e^a.b^yid) = a,y{b) = /3] (4.28) 
and the aim is to find a y G V for which the functional 

7: Y -» R,y -> y (y) := J* + y'(x) 2 dx (4.29) 

becomes minimal. The spécifie form of the functional in this example is motivated by the fact that, for parameterized 
curves in M. 2 (Le., functions y of the type /: M -> R 2 ), the arc length is defined as 

Z: V -» E, Z h» /(/) := ; a 6 ||/'W|| 2 dx (4.30) 

and the function y considered here represents the "function form" of such a curve, Le., f(x) = (x, y (x)) for which 

/'(x) évaluâtes to (l,y'(x)) . Therefore, in the spécial case of the example considered here, the Lagrange function 
is given by 

L: R -» R,y' » L(y') := -Jl + y' 2 (4.31) 

L is thus not explicitly dépendent on y and x, the partial derivatives of the Lagrange function with respect to y and 
x evaluate to zéro, and the Euler-Lagrange équation simplifies to 

S-L(y') = —4iL(y')^—4iL(y') = 0 (4.32) 
dy ^ J dxdy' ^ J dx dy' Ky J v ' 

The right-hand side of the équivalence relation (4.32) states that the total derivative of -^jL(y') is zéro and the 
function ^L(y') thus constant. One hence obtains an ordinary differential équation for y given by 



±L(y') = c ^y'(x) = J I ^ (4.33) 

for a given constant c G R (see below for détails). As the derivative of y is thus constant, y must be a linear-affine 
function of the type 

y(x) = sx + r,s,r G R (4.34) 
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Based on the fixed endpoints y(a) = a,y(b) = fi, the constants s and r may be evaluated, resulting in the solution 

y*(t) := argmin yev , f*JÏ + y'(x) 2 dx = ^(x-d) + a (4.35) 

The application of the variational method hence vérifies the intuition that the functional form of the curve 
representing the shortest distance between two points in the plane is a straight line (Figure 2 panels C and D). 

o Dérivation of équation (4.33) 

We first evaluate the partial derivative ^L(y') for L(y') := ^Jl + y' 2 : 

£W) = £ (1 + y-)\ = \ (1 + y-y\ ■ 2 ■ y = -JL= (4.36) 

The Euler Lagrange Equation 
thus implies that 



= c (4.38) 



Vi+y' : 

for a real constant c £ M. Reformulating then leads to the insight that the derivative of y w.r.t. x is constant: 

y' 2 = c 2 (l + y' 2 ) <^> y' 2 = c 2 + c 2 y' 2 <^> y' 2 - c 2 y' 2 = c 2 <^> y' 2 (l - c 2 ) = c 2 <^> y' = 
The Isoperimetric Variational Problem 



(4.39) 



In addition to the conditions of fixed endpoints, the extremal function maximizing/minimizing a given 
functional in the context of isoperimetric problems have to fulfill additional intégral conditions. In gênerai, the 
isoperimetric variational problem may be stated as follows: let (x 0 ,y 0 ), (xi,yi) be éléments of M 2 and Y the vector 
space of ail twice-differentiable functions y on (xq^) which fulfill the endpoint conditions y(x 0 ) = y 0 and 
y(xi) = y±. Further, let L and G be twice partially differentiable, real-valued functions defined on the set 
R x R x [x 0 ,x x ]. Then, the isoperimetric variational problem comprises finding a y* 6 Y such that the functional T 
defined by 

7:Y -» ffi,yHf(y) := f* 1 L(y(x), y'(x), x) dx (4.40) 

becomes extremal, whilst fulfilling the intégral side-condition 

I:Y -» M, y ^ /(y) := f* 1 G(y(x), y'(x), x) dx = 0 (4.41) 

The formai development of an approach for the solution of the isoperimetric variational problem proving the 
existence of Lagrangian multipliers falls into the realm of functional analysis and is eschewed here. Nevertheless, 
assuming the existence of both an extremal argument y* and suitable Lagrangian multipliers A, the Lagrange Lemma 
introduced in below provides the following recipe for solving an isoperimetric variational problem: 

(1) consider the Lagrange function F: R x R x [x 0 , xj -> R, defined by 

F(y(x),y'(x),x) := L(y(x),y'(x),x) + AG(y(x),y'(x),x) (4.42) 
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(2) instead of minimizing T , minimize the extended functional 

f := C F (yW-y'M- 1 ) dx (4 - 43) 

using the fixed endpoint approach developed in Section 3.2 to obtain the extremal argument y*(usually by 
means of the corresponding Euler-Lagrange équations) and 

(3) détermine A so that y* solves the side-condition J Xl G(y(x),y'(x),x) dx = 0. 

If this approach is possible, the Lagrange Lemma guarantees that the resulting y* fulfills the isoperimetric side- 
condition while minimizing the initial functional. This approach is demonstrated using a classic example below. 

It should be noted that the functions maximizing the functional of interest of the VML and VB frameworks 
are usually defined on the entire real line, rather than on a real interval with fixed endpoint conditions, as 
considered here. For simplicity, however, a formai treatment of the ensuing improper intégrais is eschewed, and the 
methods developed in the current section are transferred to the problems considered in the next section by the 
intuition that the probability density functions of interest are usually Gaussian for which /(x) -> 0 in the limits of 
x -» oo and x -» — oo. 

The Lagrange Lemma for the isoperimetric variational problem 

If the existence of a minimum y and of the corresponding Lagrangian multiplier A\s postulated, the following 
Lemma can readily be proven, and provides an ansatz for the solution of the isoperimetic variational problem: 

Laqranqe-Lemma for the isoperimetric variational problem 

Let M be an arbitrary set and J: M -* Mo function to be minimized on the restriction set 

R:={yE M\g{y) = 0} 

where g:M^>R defines a side condition. Further, letAe R and y £ R such that y is a minimum of 

f :=] +Ag : M -» M 

Then y is a Minimum of] on R. 

The Lagrange Lemma for the isoperimetric variational problem can be verified by considering the following: 

J(y) = J(9) + 0 = /(y) + Ag(y) < /(y) + Ag(y) = /(y) (4.44) 
Thus, if the problem of minimizing the extended functional 

f := 4* 1 W*)<y'<xu) dx ( 4 - 46 ) 

where 

F(y(x),y'(x),x) := L(y(x),y'(x),x) + AG(y(x),y'(x),x) (4.47) 

as specified in équations (4.40) and (4.41) can be solved, its solution provides a minimum of T which fulfills the side 
condition specified by /. 
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Example: The Problem of Dido 

The problem of Dido, to fence in a maximum area with a boundary of given length, may be stated as an 
isoperimetric variational problem as follows: Consider the vector space 

Y := {y £ C 2 [a, b] \y(a) = 0 = y (6)} (4.48) 

The aim is to find a y £ Y, which minimizes the functional 

T: Y R, y h> T(y) ~ - ^ y (x) dx (4.49) 

subject to the constraint (side-condition) that 

g (y) = 0, where g: Y -» R,y h> g{y) ~ f^JÏ + y'(x) 2 dx - l (4.50) 

for some constant l £ R. Intuitively, this problem may be stated as maximizing the area between the graph of y and 
the x-axis using a y of a given constant arc length l (Figure 2, panels C and D). Applying the recipe for the solution of 
the isoperimetric variational problem discussed above, one obtains the following extended functional (see below for 
détails): 

f := - £ y (x) dx + A (/* Vl+y'W 2 dx-l) = f a (-y(x) + A^Ï+yW ~ ^) dx (4.51) 

where the Lagrange function is not explicitly dépendent on x and is given by 

F:lxE, (y, y') » L{y, y') = -y + Ajl+y' 2 - ^ (4.52) 

For a Lagrange function that explicitly only dépends on y and y', the Euler-Lagrange équation simplifies to (see 
below for détails) 

p(y,y')--^iKy,y')y' = b (4.53) 

where B £ R is a constant. Evaluation of the Euler-Lagrange équation for the problem of Dido results in the 
following first-order ordinary differential équation for y (see below for détails): 

A /ï+7 F (-y-^-s)+A = 0 (4.54) 

A solution for this differential équation is given by 

y(x) = C + V 'A 2 - (x - D) 2 (4.55) 

where C := —B — and D £ R is an additional intégration constant (for a dérivation of this solution, see (Richter, 
Mathias, o. J.)). Taking the square of the équation above then results in 

(y(x) - C) 2 + (x - D) 2 = A 2 (4.56) 

which is recognized as the functional équation for a circle with the center at (D,C) £ R 2 and radius A. Thus the 
solution to the problem of Dido is a circular arc, the exact form of which dépends on the relationship between the 

endpoints a, b and the given arc length l (Figure 2, panels C and D). Considération of the endpoints yields D = 

while analysis of the side-condition allows for inferring A and subsequently C which can lie above or below the 
abscissa depending on the parameter. 
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o Dérivation of équation (4.53) 

d 

Provided that L is not an explicit function of x, and thus — L = dy , dx L = 0 the Euler-Lagrange équation (4.25) yields 

A/, _ ——i - o 

dy dx dy' 
y \dy dxdy' ) 

~y'fy L -(y'i^ L+ y' 2 êy L+ y' y" & 1 ) = 0 

-(^^'^ + /^L)-( y "^L + /^L + y ^L + y V^L) = 0 

^ d ^ a ( ' ^ l \ ~ 0 

dx dx V dy' / 

°è( l -/57 l ) = ° 



The refore 



(4.57) 



L(y,y')--^jL(y,y')y' = B (4.58) 



o Dérivation of équation (4.54) 

We apply 



p{y,y')-^F{y,y')y' = b (4.60) 



to the Lagrange function given in (4.52) 

F (y, y') : =-y + À J 1+ y2 _JL (461) 

and obtain 

-y + A7îT7 5 "-^-/^(-y + A7îT7 T -^;) = B (4.62) 

Multiplying both sides by + y' 2 yields 

-yVl+y' 2 + A(l + y' 2 ) - ^V 1+ y' 2 - Ay' 2 = SVl+y' 2 (4.63) 
<=> -yVl + y' 2 + A + Ay' 2 - +y' 2 - Ay' 2 - By/l + y' 2 = 0 

^^Ty T (-y-^-B) + A = 0 
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A B 




Isoperimetric Problem 




Figure 2. An intuitive notion of the calculus of variations. Figure 2A depicts a standard calculus analogy to the 
extremal problem of variational calculus. To find the extremal points x* from a given real-valued function /, 
standard calculus usually proceeds by finding its stationary points. The stationary points of a function are those 

arguments of the function for which the derivative j-f of the function vanishes, Le., for which j-f(x)\ x=x * = 0 
holds. Figure 2B depicts a conceptual overview of the variational calculus framework: central to variational calculus 
is the problem of identifying a (real-valued) function / in a function space V, which extremizes a given functional T. 
The functionals considered in this review (especially the variational free energy) map functions (in the context of the 
tutorial probability density functions) onto the real line. Candidate functions are found by capitalizing on the 
necessary condition for an extremal, Le., a vanishing Gâteaux derivative #F(/(;t))|^( X )=/*(:0 = 0 anc ' solving f° r 
f*(x). Figures 2C and 2D sketch the variational problems and the solutions to them considered as examples in 
Section 3. The fixed endpoint problem can be conceived as the problem to find the curve (Le., a mapping /: M -» M. 71 , 
here, n = 2) of minimal length Connecting two given points in M. 2 . The application of the variational calculus method 
allows for identifying the straight line Connecting both points as the optimal function (curve). The isoperimetric 
problem (here, with fixed endpoints) represents a variational problem with additional intégral side-conditions. The 
traditional example for this is the problem of Dido, which lies in finding the function of given arc length that 
maximizes the area between the function and the ordinate. Using the calculus of variations and a Lagrange multiplier 
approach, it can be shown that the solutions are circular arcs whose exact form dépends on the predetermined arc 
length. Maximization of the variational free energy in the VB framework may also be viewed as an isoperimetric 
problem. Here, the functions maximizing the variational free energy need to fulfill the intégral side-condition of 
being probability density functions on the real line, Le., their intégral over the real line should equal 1. 
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Dérivation of the VB inference theorem for mean-field approximation 

In the VB framework the aim is to approximate the log marginal likelihood ln p(y) by iteratively maximizing 
its lower bound T (<7(i9 s ), <7(i9\ s )) with respect to the arguments <7(i9 s ) and q(i9\ s ). For itérations i = 1,2, during 
the first maximization of finding q^ l+1 ^{~ds), q^{p\s) is treated as a constant, while during the second maximization 
of finding <7^ +1 - ) (t9\ s ), q^ l+v> {~ds) is treated as a constant. However, as i9 s and i9\ s may be used interchangeably, we 
here concern ourselves only with the case of maximizing T (q(i9 s ), q(i\s)) with respect to q(i9 s ). To obtain an 
expression for q( l+1 \d s ), we thus consider the functional 

T q»(A s )) = If <ZG? S )<7 (0 OV) 1" d ^\s (4.65) 

with the constraint 

J q(i9 s )di9 s = 1 <=> J q(t9 s )di9 s - 1 = 0 <=> J q(t9 s ) - <5(t9 s )dt9 s = 0 (4.66) 
where S dénotes the Dirac delta function which intégrâtes to 1. 
The Lagrange function is given in this case as 

F (qGU q«0v)) = î <7G9 S )<7 (0 6V) ln ( ^g^) ) d,9 V + M**) - (4.67) 

and, as in the previous section, the Gâteaux derivative ST (q(x), q^ÇO)^ is given by the derivative of F with respect 
to q(i9 s ), as F is not a function of <7'(i9 s ) . One thus obtains with a constant C £ R (see below for détails) 

Sf (q(iU<? (0 OV)) = Jq (0 (As)lnp(y,i9) dt9 Xs - lnq(t9 s ) + C (4.68) 
o Dérivation of équation (4.68) 

8î(q{d s ),qV{d Xs )) 

= âifc / q (i) (' 9 \ S ) ln P(y^) d*\ s - / <7G9s)<7 (0 6V) InqCfl,) d^ s - / <KtfsV°(Xs)ln<7 (i) 6V) dV\ s + Aqfo) - XS(d s )) 

= Jq®(tV)lnp(y,t9)dt9 Xs - lnqO? s ) - 1 - JV° (t9\ s ) ln g® (tf\ s ) dti\ s + À 
= jV°(t9\ s ) ln p(y, 0) dt9\ s - ln q(t9 s ) + C 

(4.69) 
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Setting the Gâteaux derivative (4.68) to zéro thus yields 

ln^ +1 )(,9 s ) == Jq» (t9 Xs ) ln p(y,û)d^ + C (4.70) 

Taking the exponential and expressing the multiplicative constant as proportionality leads to the "VB inference 
theorem for mean-field approximations" 

^ +1 )(i9 s ) « exp(Jq(tV)lnp(y,tf)diV) (4.71) 
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5 Détails of the univariate Gaussian example 
5.1 Introduction 

To demonstrate the VB inference method for i.i.d. data we consider the Bayesian estimation of the 
expectation and variance parameters of a univariate Gaussian (discussed in détail in (Michael Chappell, Adrian 
Groves and Mark Woolrich, 2008; W. D. Penny, 2000) and visualized in Figure 5 of the tutorial). We thus assume that 
the data is generated by a univariate Gaussian distribution over an observable variable y t with true, but unknown, 
expectation parameter/i and inverse variance (précision) parameterA, Le., 

p(yù--=N{y i ;n,X- 1 ) (5.1) 

Further, we assume that a set of N i.i.d. realizations y{, ... , y^ of the observable variable y t has been obtained. The 
concaténation of this set is denoted as y ■■= (y x , ... ,y N ) T . In this scénario, a classical maximum likelihood point 
estimation approach would resuit in point estimâtes for fi and A -1 based on the sample mean /t and the (biased) 
sample variance â 2 

M = ^Zn=iy n and o 2 = ^Zn=i(yn - M) 2 (5.2) 

respectively. As outlined in the tutorial, based on appropriately chosen prior distributions, the aim of the Bayesian 
paradigm is to obtain posterior probability distributions, which quantify the remaining uncertainty over the true, but 
unknown, unobservable variables given the observable variables and an approximation to the model évidence, i.e. 
the probability of the data given the generative model. Following the convention outlined in the tutorial (Section 3.1) 
"generative model" refers to a joint distribution over the observable and the unobservable variables. Thus in the 
current example it constitutes a joint probability distribution over the observable variable y and both unobservable 
random variables \i and A, A > 0. It is given by 

piy.H.A) = p(ji,A)p(y\[i,A) = p(ji,A) Y\n=iV (ynlM<^) (5-3) 

A possible choice for a prior joint distribution over the unobservable variables is given by the product of a univariate 
Gaussian distribution over fi and a Gamma distribution over A according to 

p(y,H,X) = p{n)p{A) V {y\n,A) = NQi;m IJl ,sfiG&; a À ,b À )Un=iN (y n ; M -1 ) (5.4) 

Note that many other priors are conceivable. In fact, a more commonly discussed scénario (Murphy, 2012; Bishop 
2006) is the case of a non-independent Gaussian-Gamma prior. With respect to the factorized prior considered here, 
this Gaussian-Gamma prior has the advantage that it is "closed-under-sampling" or, in other words, belongs to the 
conjugate-exponential class and allows for the dérivation of an exact analytical solution for the form of the posterior 
distribution. On the other hand, it is not clear in which scénarios a dependency of the prior over the expectation 
parameter fi on the prior over A is in fact a reasonable assumption. We here discuss the factorized (independent 
prior) case, as in our view, it corresponds to the more parsimonious choice of prior, and demonstrates how the 
variational Bayes approach can be used to dérive (variational) posterior distributions in model scénarios where no 
analytical treatment is possible. To this end, we consider a mean-field approximation to the posterior distribution 
over the data conditional unobservable variables, i.e., we set 

pGU|y) * qG0q(A) (5-5) 
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Specifically, we set q(p) to a normal distribution over fi with variational parameters m q and s^ q , and we set q{A) to 
a Gamma distribution over A with variational parameters a q , b q resulting in 

qQi)q(X) = N(x;m q ,s^)G(A;alb q ) (5.6) 

In (5.6), {m^,s^ q ,a^,b^} represents a set of "variational" parameters. For convenience and to keep the notational 
overhead at bay, the values of the variational parameters on the ith itération of the VB algorithm derived below will 
be denoted by m^\s^ l \a£\ and b^\ dropping the "q" superscript. To summarize, we have introduced a set of 
prior parameters {m^s^, a À , b À ) governing the prior distribution p{pL,X) of the generative model in (5.3) and a set 
of variational parameters ^mjp , s^ l \ governing the variational approximation q(fi)q(A) to the posterior 

distribution p(fi,A\y). 

5.2 Application of the variational inference for approximate posteriors 

As discussed in the tutorial the variational inference theorem for mean-field approximations (cf. équation 
(17)) states that the variational distribution over the unobservable variable partition i9 s is given by 

q&s) oc exp(Jq(tV)lnp(y,t9)dtV) (5-7) 
For the current example, we have the mean field approximation 

q (s, A) = q (m) q (A) (5.8) 

and thus 

qQj.) = C M • exp(J q{A) ln p(y, \i, A) dA) (5.9) 

and 

q{A) = C x ■ exp(J qQi)\np(y,n,X)dii) (5.10) 

where and Q dénote proportionality constants that render the proportionality statement in (5.7) équations in 
(5.9) and (5.10). In the following, we dérive an itérative scheme based on the équations above. For this purpose, it is 
helpful (1.) to dénote expectations by the expectation operator 

</O0W) == / p{x)f{x)dx (5.11) 

because it considerably reduces the visual complexity of the expressions involved, (2.) explicitly dénote the itérative 
nature of the approach by indexing the variational distributions q(fï) and q(A) as q^O-O an d I^W, which also 
stresses the fact that in (5.9) and (5.10) above, the left hand side variational distribution refers to the (i + l)th 
itération, while the right hand side variational distribution refers to the ith itération, and (3.), because we are 
dealing with probability density functions of the exponential family, to log transform the expressions above to 
simplify proceedings. Upon thèse notational changes, (5.9) and (5.10) can be re-expressed as 

\nq( i+1 \ii) = (\np(y,n,A)) q&m +1^ (5.12) 
lnq( i+1 \A) == (\np(y,n,A)) q a + i)^ +lnC A (5.13) 
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In the following two sections, we will dérive explicit expressions for the variational parameters m^ +1 \ s^ l+1 ^ and 

a^ +1 \b[ l+1 ^ that govern the variational distributions q^ l+1 \[i) and q^ l+1 \A), respectively. Thèse expressions are 

given in terms of the observable data, as well as the preceeding variational parameters m^\s^ l \a^\ and the 
prior parameters m^, sfi, a^, bx as will be seen below. 

5.3 Variational parameter update-equation for q(fi) 

As noted above, the variational inference theorem for mean-field approximations states that the logarithm 
of the optimal distribution over the unobservable variable \i, ln q( l+1 \[i) , given a fixed distribution over the 
parameter variable, q^(X), is given by 

\nq( i+1 \fi) = {\n V {y,ii,X)) q u {X) +\nC ll (5.14) 

which, if we identify the prior distribution p(fi)p(A) with the initial variational distribution q^(jj.)q^Q) may be 
rewritten as (détails below) 

Inq^Qx) = -\<X> q m w I% =1 (y n - ~~^ + ^ (5.15) 

Based on (5.15) and the completing-the-square theorem (Supplément Section 2), one can infer that the optimal 
variational distribution over the latent variable [i is proportional to a Gaussian 

q^Qi) ce N(n;mj! +1 \ s f +1) ) (5.16) 

where the updated variational parameters are provided in terms of the data y 1( ...,y N and the variational 
parameters of q^(fï) and q^(X) as 

o Dérivation of équation (5.15) 

For a fixed distribution q^ÇX), we have due to the i.i.d. data assumption and the assumption of factorized 
priors for y = y VN 

\nq( i+1 Xii) = (\np(y,n,V) q&m +\nC ll 

= (ln(rin=l P(y n < i"^)))q(0 (A ) + lnC M 

= <ln(rin=l PiVn IM) PGM))>q(0(A) + ln 

= (ln(rin=i P(y n P(M)pW)>q(0 (A) +lnC M 

= <ln(nn=iP(ynlM))>q(0 (A) + <ln p(j"))q(0 (A) + (ln pW) q (O w + ln C M 

(5.18) 

The variational distribution is initialized with the prior distribution 
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P (ji)pW = qW(ji)q«>\X) 



(5.19) 



Substituting the example spécifie probability density functions of p(fï),p(A),p(y n \fi,A), q^Qi) and q^ l \A) from 
(5.4) and (5.6), namely 



PGO : = fas£) 2 exp (~ 0 ~ ™ M ) 2 ) 



(5.20) 



(5.21) 



P(y n Im, ^) : = (^) 2 exp (- 1 (y n - jx) 2 ) 



(5.22) 



q©G0 := ( 2 ^ (0 )"exp ( * - m») 



(5.23) 



,C0. 



* (0 W : = „f!<m. l^ '- 1 exp ( - ^ 



(5.24) 



in (5.18) then yields 



lnqC^GO = (ln(m=i(4) 5 exp(-^(y n -M) 2 )jVo (A) 
+ (ln ((27rs2)"exp (" i^O " "v)')) ) q (0 (A) 

+ <ln (Fè)(^i AaA " lex p(-è))V)a) 

+ lnC M 

= < ^lnA - ^ln 2?r - \XY%=\(y n ~ f0 2 ) q (f){X) 



+ <- ln(r(a A )) - a A ln(6 A ) + (a À - 1) ln A - ±)) q m w 



+ lnC„ 



(5.25) 



Grouping ail constant terms independent of [L (i.e. those terms that do not change if [l changes) with a constant 
and évaluation of the expectation then simplifies the above to 



in^ +1 )( M ) = -\(XYSUi(yn - tf) q u w - 1 2 ( ^ i 5 rL K^w + c - 



2 _ 1 {n-m^f 



2 si 



(5.26) 
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o Dérivation of équations (5.16) and (5.17) 

Starting from 

\nq^(n) = -\{X) q u w Y% =1 (y n - M) 2 ~~^ + (5-27) 

we first note that 

= a N° (5 - 28) 

and that 

ZLiiyn - m) 2 = Z^=i(yn 2 - 2y nf i + m 2 ) = z^iyn 2 - 2^ =1 y n + n^ 2 (5.29) 

Substitution then yields 



2 



= -5(0^° g?. l3 <S - a® fc® 2 M E£ =1 y n + a®*®^ 2 + | M 2 - ^ + ^m 2 ) + 

(5.30) 

Reordering the last expression on the right-hand side in terms of powers of \i, we obtain 

lnqC'^ÛO = ~\{af bf +^ 2 - a® è® 2 M ^ =1 y n - ^ + a® è® l£ =1 y 2 + |m 2 ) + 

= - \ ((a® fc® JV + |) m 2 - ( 2 a® fc® E£ =1 y n + ^) „ + a® fc® I» =1 y 2 + ^m 2 ) + C, 

(5.31) 

Resolving the brackets, and grouping terms independent of fi with the constant C^, resulting in the new constant C^, 
then simplifies the above to 

ln^ +1 V) = -\{$b®N + ^H 2 + (a®è® I» =1 y n + + 4 (5.32) 

Finally, we may re-express the coefficient of y 2 more conveniently by setting 

a®b!>N + 4 = + 4 = (5.33) 

■V -V -V s fi 

and obtain 
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ln^ +1 V) = -\{^^-y + (a? bf Z» n=1 y n + + <?, (5.34) 
Using the completing-the-square theorem in the form 

exp {~\ax 2 + bx) = N(x; aT^b, a -1 ) • C (5.35) 

then yields 

q^X^KN^-m^.sf^) (5.36) 

where the variationa l parameters m^ +1 \ s£ 1+1 may be expressed in terms of the ith variational parameters of 
q^(X), the prior parameters and s 2 and the dat a y n as 



and 

(i+l) 



\ 4 ) ~ ^JM> (537) 

sg / m M +qWflW^ljif =1 y n \ 
1+Nslafbf \ si ) 



1+Nslafbf 



(5.39) 

□ 



5.4 Variational parameter update-equation dérivation for q(A) 



The variational inference theorem for approximated posteriors (5.13) states that the logarithm of the 
optimal distribution over the unobservable variable A, given a momentarily fixed distribution over the unobservable 
variable \i, is obtained by setting (with a constant Q G M) 

\nq^ i+1 \A) == (\np(y,n,A)) q a + i Kll) -flnQ (5.40) 

which, if we identify the prior distribution p(fi)p(A) with the initial variational distribution q^°\x)q^°\A), may be 
rewritten as 

\nq( i+1 \A) =^\nA - f<l£ =1 (y n - /^V^ + (a A - 1) ln A - A + C A (5.41) 

Further, upon évaluation of the expectation and taking the exponential of (5.13), it follows that q^ l+1 \A) is 
proportional to a Gamma distribution 

G(A;4 i+1 \b^) (5.42) 

with parameters 
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a X — 2 



+ a,and := U + - 2 ^ =1 y B mf 1} + N ((mf 1 ^ + S/ f +1) )V) ' (5.43) 



o Dérivation of équation (5.40) 

In analogy to the dérivation of équations (5.15) we have 



ln qO + D U) = (in (n£ =1 (£f e xp (- £ (y n - m) 2 )))^)^ 

(ln ^(2tts2 (i+1) ) 2 exp ("-py (/< - m£ +1) ) 2 ))V i+1) 00 



+ ■ 



+ <ln (Fè)(^I^" lexp (-è))V-)(,) 
+ lnC A 

= <jln A - iln 2tt - ^Zn=i(y n - M) 2 >q(i+D (/i) 



+ <- i ln(27r) - -1ns; - -p^ ) g(t+1)0t) 

+ (- ln(r(a A )) - ln((ô A )^) + Oa - l)lnA - ^)> £? a + D (M) 
+ lnC A 



(5.44) 



Grouping ail constant terms independent of A (i.e. those terms that do not change if A changes) in a constant 
C A £ M. then simplifies the above to 

ln ^ +1 )(A) = <£ln A - £ l£ =1 (y n - m) 2 V+Dqo + <(a A - 1) ln A - + C A 

= jlnA -- 2 (^ =1 (y n - n) 2 ) q ^ w + 0 A - 1) InA - A + C A 

(5.45) 

□ 

o Dérivation of équations (5.41) and (5.42) 

Reorganizing the right hand side of équation (5.40) in multiplicative terms involving A yields 

ln q^\X) = g + a A - l) InA - + i<I^ =1 (y n - m) 2 ) q «+» W ) * + Q 

= g + a A - l) ln A - + \ <Z» =1 yl - 2n E£ =1 y n + N M 2 ) q(i+1)(/t) ) A + C x 

= g + a x - l) ln A - ^ + i (l£ =1 y n 2 -2 E£ =1 y n (M> q « + i) M + N(fi 2 ) q a + ^ w fj A + C À 



(5.46) 
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where the expectations may be expressed in terms of the variational parameters of q( l+1 \[i) according to 




2 (i+D 



)) 



A+C x (5.47) 



Taking the exponential on both sides then yields 




(5.48) 



Up to a normalization constant C x , q ( - l+1 \A') is thus given by a Gamma distribution with 




(5.49) 



where 
and 




(5.50) 




-i 



(i+i) 



(5.51) 



□ 



5.6 Evaluation of the variational free energy 

While the maximization of the variational free energy has been exploited to dérive the variational parameter update 
équations, the value of the variational free energy has not yet been directly evaluated. To obtain a value of this 
(implicit) target function to monitor the évolution of the algorithm, and, more importantly, to obtain an 
approximation to the marginal likelihood upon convergence, it is désirable to evaluate the variational free energy 
intégral (cf. équation (13) of the tutorial). We first reformulate the variational free energy as follows (Penny, 2000) 



HiW) = J in(p(y|0)) M ~ f ln (m) d $ ■■= £av{p(y\v),q(v)) - 3CC(g(0)||p(0)) (5-52) 



The first term in the expression on the right hand side of (5.53) is sometimes referred to as the "average likelihood" 
and the second term is the KL-divergence between the variational and prior distributions. 

o Dérivation of (5.52) 

By définition of the variational free energy, we have 



F(q(*)) = Jq(*)ln(^)d* 




Jç(t9)ln(p(y|t9))-ln(^)cW 

/ q{d) ln(p(y|t9)) dd - / q{d) ln (^) dd 
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= : L m (p<y\n qW) - X£(q(fi)\\p(fi)) 

(5.53) 

□ 

Upon this reformulation the respective terms comprising the variational free energy may then be evaluated in turn 
as shown below, resulting in 

T (V%)q«a)) := L av (p(y\0\q®(li)q®ai) " X£(q«00q«(A)||pG0p(A)) (5.54) 

where 

La* (p(y|t9),q«(^)q«a)) = ^(af ) + K&f)) - Jaffef (^ =1 y„ 2 + N ((m«) 2 + sf) - 2mf S =1 y„) ( 5.55) 

and 

XLCq®Qi)qW(X)\\pQi)pW) = KX(<? (0 G0l|pG0) + JCC(q«(A)||p(A)) (5.56) 

with 

JC£(q»(A) | |p(D) = (èf - l) ip (èf) - In af - fc» - log r (èf) + ln rfe) + b x ln a, - (fc - 1) (ftf ) + ln a») + ^ 

(5.57) 

We thus obtain an expression for the variational free energy as function of the observable data y x , ...,y W/ the 
variational parameters s^ l \ and the prior parameters m M ,5^,a A and 

o Dérivation of (5.54) 

We first recall the définitions of p(y|i9),p(i9) and q(i9) for the current example: 

p(y|0) = P(yl/U) = n^=iN(y n ;p,A) (5.58) 
p(t9) = p(p,A) =:p(p)p(A) = N(p;m M ,s2)G(A;a A ,6 A ) (5.59) 

and 

q(jB) = q(n,X) =:q(n)qW = N \n;mf ',s* W )g [X;af >f ) (5.60) 
(a) Evaluation of £ av (p(y|tf), q«(p)q (0 (A))in (5.55) 
Substitution of (5.58) and (5.59) in (5.54) yields, 

Sq{d)\rv(p(y\dj)dd= JJ <^(p)^(A) ln (X[» =1 N (y n ;p,A)) dpdA 

= JJ q (0 (pV° (A) ln m=i exp (- * (y n - p) 2 )) dpdA 
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= JJV°GO<ï (0 (A) gin A - iln 2n - ^ =1 (y n - pi) 2 ) dpidX 

= |// q®Qi)q®W On A) d/^dA - JJ g® Qi)q® (A) gz^i(y n - M) 2 ) dpdX - \\n lu 
= \l g® (A) ln A dA - / g «(A) * (J g® GOŒJUGfe _ M )2) dA _ i ln 2n 

(5.61) 

The first intégral term above is the expectation of a logarithm variable A under the variational Gamma distribution 
G [à.; bj^) which can be shown to evaluate to 

J q © (A) ln A dA = (af ) + ln 6 f 

(5.62) 
where 

i/):1-»1,ïh iK x ) := ^ T(x) (5.63) 
dénotes the di-gamma function (see also 2.34). The second intégral term in (5.61) évaluâtes to 

/q®(A)iA(/q®Oi)Œ^=i(^ -n) 2 )dfi)dA = J q®(X) Qa) (J <? (0 G0Œn=iyn " 2*x + N/i 2 ) d M ) dA 

= i J (A) A(Z^ =1 y n 2 - 2 E£ =1 y n fq®(pi)iidn + Nf q^(ji)n 2 dn) dA 

(5.64) 

(b) Evaluation of X£(q(i9)||p(i9)) in (5.54) 

We first note that we have, omitting superscripts for ease of notation, 

XL(qQi)qW\\pQi)pW) = j q(jx)q (A) ln d0A 

= Jq( ^(A)(ln^ + ln^)d,dA 

= J^)(ln^)d, + J,(A)(lnH|)dA 
= X£(qG0|| P G0) +X£(q(A)||p(A)) 

(5.65) 

In the current example, the variable fi is governed by Gaussian distribution both for the prior and in the variational 
approximation to the posterior distribution. The variational distribution q(/i) corresponds to the ith variational 
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Gaussian distribution over/i and has parameters m^\s^ 1 whereas the Gaussian prior distribution over/i has 
parameters m^, s£ . We thus have 

K£(q(n)\\pQi)) =K£(q<f>(n)\\p(nj) =X£(N[ii;m^,s^ 0 )\\N^;m ll ,s^j (5.66) 

As noted in (Penny, 2000), the KL divergence for two Gaussian distributions is a function of the respective 
distribution parameters and given for the current example as 

KL ( N („; mf, sf) | \N(« m,, sj>j) = \Xn Jfr + V™f~f _ i (5 6?) 

Finally, for the current example, the variable A is governed by a Gamma distribution both for the prior and in the 
variational approximation to the posterior distribution. The variational distribution q(X) corresponds to the ith 

variational Gamma distribution overA and has parameters a^\b[ l \ whereas the Gamma prior distribution overA 
has parameters a x , b À ■ Hence, we have 

XL(qa)\\p(ti) =5Cx(q (t) (A)||p(A)) = JCC (g (A; af ,bf)\\G{X;aM) (5.68) 

As noted in (Penny, 2000), the KL divergence for two Gamma distributions is a function of the respective Gamma 
distribution parameters and given for the current example as 

KL {G(X; af, b?)\\G{X; a À , bj) = (fcf - l)^(èf ) - ln af - fcf - lo g r(ôf ) + ln r(i»J + b A ln a x - (b À - l)(^(ôf ) + ln af) + 

(5.69) 

Concatenating the results from (a) and (b) then results in expressions (5.55) for the variational free energy. 

□ 

The VB algorithm derived in this example is shown in pseudocode form in Table 3. 
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% Initialization of the variational parameters and variational free energy 



(o) 

-2(0) . = ç 2 
(0) 



F<°> = T (y, W, m<°>, sf\ af, bf, m„, sj, a h b,) 
%VB itérations 

for i = 0,1, 2, ...until convergence do 
% update of q(pi) 

(i+i) _ m M +44°4°£«=iyn 



S 2 



(i+1) . s£ 



% update of q(A) 

2 



% Free Energy Evaluation 

F(^) := T (y, iV, m£ 0) , sf\ af , , m M , s 2 , a A , ô A ) 




end 



% VB posterior unobservable variable distribution and log model évidence approximation 

pGUbO™ :=N(,;mï +1 \sf +1) ).G(A;at 1 \br i) ) 
\np(y) VB :=F^ 



Table 3. Pseudocode for the VB Algorithm for the univariate Gaussian. 
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5.7 Matlab implementation 

The following Matlab code implementing the pseudocode of Table 3 was used to generate Figure 5 of the tutorial. 



function vb_lgssm_l 

% This function demonstrates the variational Bayesian algorithm for the 
% example of inferring the mean and précision (inverse variance) of a 
% univariate Gaussian distribution as discussed in the tutorial section 
% 3.2 and supplementary section 5. 

% Copyright (C) Dirk Ostwald 



clc 

close ail 



% Random number generator settings 

rng ( ' shuf f le ' ) 

% Parameter Space Initialization 



% specify observed variable y space 
ymax = 4 
ymin = -4 
yres = 100; 

yspace = linspace (ymin, ymax, yres) ; 

% specify unobserved variable mu space 
mumax = 2 ; 
mumin = -2; 
mures = 100; 

muspace = linspace (mumin, mumax, mures); 

% specify parameter variable lambda space 
lmax = 7 ; 
1min = 0.001; 
1res = 100 ; 

lspace = linspace (lmax, 1min, 1res) ; 



% Spécification of the true, but unknown, distribution 

% true, but unknown, value of the latent variable x 

mu_tbu = 1 ; 

% true, but unknown, value of the parameter variable lambda 

l_tbu = 5 ; 

% true, but unknown, underlying model 

y_tbu = ( ( (l_tbu) / (2*pi) ) * (1/2) ) . *exp ( (-1 tbu/2) * ( (yspace-mu_tbu) . "2) ) , 



IID sampling from the true, but unknown, model 



% number of observed variable data points 
N =10; 

% i.i.d. sampling 
yn = NaN (N, 1) ; 
for i = 1:N 

yn(i) = mu_tbu + sqrt ( 1 /l_tbu) *randn; 

end 



Variational Bayes algorithm initialization 



% generative model prior distribution 
m_mu_prior = 0; 



s mu prior 

a_l_prior 

b_l_prior 



% maximum number of itérations 
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maxlter 



= 8; 



% initialization of parameter, variational distribution and FE arrays 



m_mu = NaN (maxlter+l , 1 ) 

s mu = NaN (maxlter+l , 1 ) 

a_l = NaN (maxlter+l, 1) 

b_l = NaN (maxlter+l, 1) 

vFE it = NaN (maxlter+l, 1) 



% expectation of Gaussian over x 

% variance of Gaussian over x 

% shape of Gamma over 1 

% scale of Gamma over 1 

% variational free energy 



qmuql = NaN (mures , 1res , maxlter+l ) ; % q(mu)q(l) on muspace x lspace support 

% initialization of the variational distribution parameters 
m_mu(l) = mmuprior; 

s_mu(l) = s_mu_prior; 

a_l(l) = a_l_prior; 

b_l (1) = b_l_prior; 

% evaluate the factorized Gaussian-Gamma distribution over x and lambda 
qmuql (:,:,1) = pdf _qmuql (muspace, lspace, m_mu(l), s_mu(l), a_l(l), b_l(l)); 

% evaluate the expectations of the marginal distributions and the 

% variational estimate of the true, but unknown, model based on thèse 

% expectations 

exp_mu_qmu ( 1 ) = m_mu ( 1 ) ; 

expJL_ql(l) = a_l (1) *b_l (1) ; 

exp_y_qmul ( 1 , : ) = ( ( ( exp_l_ql ) / ( 2 *pi ) ) * ( 1 12 ) ) . *exp ( ( -exp_l_ql/2 ) * ( (yspace-exp_mu_qmu ). A 2)); 



% data statistics 

y_bar 

yy_bar 



= sum (yn, 1) ; 
= sum ( yn . "2 , 1 ) ; 



% evaluate the variational free energy upon initialization 

vFE_it(l) = vFE(N,y_bar, yy_bar, m_mu(l), s_mu(l), m_mu(l), s_mu(l), a_l(l), b_l ( 1 ) , a_l ( 1 ) , b_l(l)); 



Variational Bayesian algorithm itérations 



for i = 1 :maxlter 



% q(\mu) update 



m_mu(i+l) = (m_mu_prior + ( s_mu_prior *a_l ( i ) *b_l ( i ) *y_bar ) ) / ( 1 + (N*s_mu_prior*a_l (i) *b_l (i) ) ) ; 

s_mu(i+l) = s_mu_prior/ ( 1 + (N*s_mu_prior*a_l (i) *b_l (i) ) ) ; 

% evaluate the factorized Gaussian-Gamma after update of q(x) 

qmuql (:,:, 2*i) = pdf _qmuql (muspace, lspace, m_mu(i+l), s_mu(i+l), a_l(i), b_l(i)); 

% evaluate the expectations of the marginal distributions and the 

% variational estimate of the true, but unknown, model based on thèse 

% expectations 

exp_mu_qmu = m_mu(i+l); 

expJL_ql = a_l (i) *b_l (i) ; 

exp_y_qmul (2*i, : ) = ( ( (exp_l_ql ) / ( 2 *pi ) ) * ( 1 12 ) ) . *exp ( ( -exp_l_ql/2 ) * ( ( yspace-exp_mu_qmu ).~2)); 

% q(Mambda) update 



a_l(i+l) = (N/2) + a_l_prior; 

b_l(i+l) = 1/ (l/b_l_prior + ( (1/2) * (yy_bar - ( 2 *m_mu ( i+1 ) *y_bar ) + (N* (m_mu ( i+1 ) "2 + s_mu ( i+1 )))))) ; 

% evaluate the factorized Gaussian-Gamma after update of q(x) 

qmuql (:,:, 2*i+l ) = pdf _qmuql (muspace, lspace, m_mu (i+1 ) , s_mu(i+l), a_l(i+l), b_l(i+l)); 

% evaluate the expectations of the marginal distributions and the 

% variational estimate of the true, but unknown, model based on thèse 

% expectations 

exp_mu_qmu = m_mu(i+l); 

exp^l_ql = a_l (i+1) *b_l (i+1) ; 

exp_y_qmul (2*i+l, : ) = ( ( (exp_l_ql) / (2*pi) ) " (1/2) ) . *exp ( (-exp_l_ql/2 ) * ( (yspace-exp_mu_qmu) . "2) ) ; 

% évaluation of the variational free energy 



vFE_it(i+l) = vFE(N,y_bar, yy_bar, m_mu(i+l), s_mu(i+l), m_mu_prior, s_mu_prior, a_l(i+l), 
b_l ( i+1 ) , a_l_prior , b_l_prior) ; 



end 



% Resuit plotting 

h = figure; 

set(h, 'Color', [1 1 1]) 

% initialization 
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subplot (3, 5, 1) 
hold on 

plot(yspace, y_tbu , ' k- 1 , 'LineWidth', 2) 

plot(yspace, exp_y_qmul ( 1 , : ) , 'k:' , 'LineWidth', 2) 

plot(yn, zeros(N,l) , ' ro ' , ' MarkerSize ' , 6, ' MarkerFaceColor ' , 'r', ' MarkerEdgeColor ' , 'r') 

axis square 

set(gca, 'FontSize', 16, 'LineWidth', 2, ' FontName ' , 'Times New Roman') 

subplot (3,5,6) 
hold on 

contourf (muspace, lspace, qmuql ( : , : , 1 ) ' ) 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'YDir', ' Normal ',' FontName ' , 'Times New Roman') 
plot (mu_tbu, l_tbu, 'wo', 'MarkerSize', 6, 'MarkerFaceColor', 'w') 
axis square 

% first itération 
subplot (3,5,2) 
hold on 

plot(yspace, y_tbu , ' k- 1 , 'LineWidth', 2) 

plot(yspace, exp_y_qmul (2, : ) , 'k:' , 'LineWidth', 2) 

plot(yn, zeros(N,l) , ' ro ' , 'MarkerSize', 6, 'MarkerFaceColor', 'r', 'MarkerEdgeColor', 'r') 

axis square 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'FontName', 'Times New Roman') 

subplot (3, 5, 7) 
hold on 

contourf (muspace, lspace, qmuql ( : , : , 2 ) ' ) 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'YDir', ' Normal ',' FontName ' , 'Times New Roman') 
plot (mu_tbu, l_tbu, 'wo', 'MarkerSize', 6, 'MarkerFaceColor', 'w') 
axis square 

subplot (3,5,3) 
hold on 

plot(yspace, y_tbu , ' k- ' , 'LineWidth', 2) 

plot(yspace, exp_y_qmul (3, : ) , 'k:' , 'LineWidth', 2) 

plot(yn, zéros (N, 1) , ' ro ' , 'MarkerSize', 6, 'MarkerFaceColor', 'r', 'MarkerEdgeColor', 'r') 

axis square 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'FontName', 'Times New Roman') 

subplot (3, 5, 8) 
hold on 

contourf (muspace, lspace, qmuql ( : , : , 3) ' ) 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'YDir', ' Normal ',' FontName ' , 'Times New Roman') 
plot (mu_tbu, l_tbu, 'wo', 'MarkerSize', 6, 'MarkerFaceColor', 'w') 
axis square 

% last itération 
subplot (3, 5, 4) 
hold on 

plot(yspace, y_tbu , ' k- ' , 'LineWidth', 2) 

plot(yspace, exp_y_qmul (end-1 , : ) , 'k:' , 'LineWidth', 2) 

plot(yn, zéros (N, 1) , ' ro ' , 'MarkerSize', 6, 'MarkerFaceColor', 'r', 'MarkerEdgeColor', 'r') 

axis square 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'FontName', 'Times New Roman') 

subplot (3, 5, 9) 
hold on 

contourf (muspace, lspace, qmuql (:,:, end-1 )' ) 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'YDir', ' Normal ',' FontName ' , 'Times New Roman') 
plot (mu_tbu, l_tbu, 'wo', 'MarkerSize', 6, 'MarkerFaceColor', 'w') 
axis square 

subplot (3,5,5) 
hold on 

plot(yspace, y_tbu , ' k- ' , 'LineWidth', 2) 

plot(yspace, exp_y_qmul (end, : ) , 'k:' , 'LineWidth', 2) 

plot(yn, zéros (N, 1) , ' ro ' , 'MarkerSize', 6, 'MarkerFaceColor', 'r', 'MarkerEdgeColor', 'r') 

axis square 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'FontName', 'Times New Roman') 

subplot (3, 5, 10) 
hold on 

contourf (muspace, lspace, qmuql (:,:, end) ' ) 

set(gca, 'FontSize', 16, 'LineWidth', 2, 'YDir', ' Normal 1 ,' FontName ' , 'Times New Roman') 
plot (mu_tbu, l_tbu, 'wo', 'MarkerSize', 6, 'MarkerFaceColor', 'w') 
axis square 

maximize (h) 

subplot (3, 5, [11 : 15] ) 

plot ( [0 :maxlter] , vFE_it ( 1 : end) , 'ko-', 'MarkerFaceColor', ' k ',' MarkerSize ' , 8, 'LineWidth', 2) 
set(gca, 'FontSize', 24, 'LineWidth', 2, 'FontName', 'Times New Roman') 
xlim([0 maxlter] ) 
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box off 



end 

function [qmuql] 



pdf _qmuql (muspace , lspace, m_mu, s_mu, a_l, b_l) 



% This function évaluâtes the factorized variational probability density 
% function over a Gaussian distributed variable mu and a Gamma distributed 
% variable lambda on the support muspace x lspace according to 



p(mu,l; m mu, s_mu, a l, b_l) 



N(mu;m mu, s mu)*G(l;a l,b 1) 



Inputs 



muspace 

lspace 

m_mu 

s_mu 

a_l 

b 1 



support of variable mu (u) 
support of variable 1 (lambda) 
scalar expectation parameter of N(mu;m_mu 
scalar variance parameter of N(mu;m_mu 
scalar shape parameter of G(l;a_l,b 

scalar scale parameter of G(l;a_l 



mu) 
mu) 



_1) 
b 1) 



Outputs 



qmuql 



p(mu,l; m mu , s_mu , a_l , b_l) 



% Copyright (C) Dirk Ostwald 



% evaluate the factorized distribution q(mu)q(l) 
qmuql = NaN (length (muspace) , length ( lspace )) ; 
for x = 1 : length (muspace ) 

for 1 = 1 : length (lspace) 

qmuql (x,l) = ( 1 / ( sqrt ( 2 *pi*s_mu) ) ) *exp ( ( -1 / ( 2 ' 
( (1/gamma (a_l) ) * ( (lspace (1) * (a_l- 



s_mu) ) 
1) ) / (b 



' (muspace (x) - m_mu) "2 ) * . . . 
_l A a_l) ) *exp (-lspace (1) /b_l) ) 



end 



end 
end 



function [vFE] 



vFE(N,y_bar, yy_bar, m_mu, s_mu, m mu 0, s_mu_0, al, b l,a 1 0, b_l_0) 



% This function évaluâtes the variational free energy as an apprxoximation 
% to the log marginal likelihood (log model évidence) of the example as 
% discussed in supplément section 6. 



% Inputs 



N 

y_bar 

yy_bar 

m_mu 

s_mu 

m_mu_0 

s_mu_0 

a_l 

b_l 

a_l_0 

b 1 0 



number of data points 

sum of data points y_n 

sum of squared data points y_n~2 

ith expectation parameter of unobserved variable mu variational distribution 
ith variance parameter of of unobserved variable mu variational distribution 
expectation parameter initialization of unobserved variable mu variational distribution 
expectation parameter initialization of unobserved variable mu variational distribution 
ith shape parameter of parameter variable \lambda variational distribution 
ith scale parameter of paramter variable \lambda variational distribution 
shape parameter initialization of parameter variable lambda variational distribution 
scale parameter initialization of parameter variable lambda variational distribution 



% Outputs 

% vFE : variational free energy 

% Copyright (C) Dirk Ostwald 



% termwise évaluation of the average energy 

l_av = (1/2) * (psi (a_l) + log(b_l)) - ( (1/2) *a_l*b_l* (yyjoar + N* (m_mu"2 + s_mu) 



(2*m_mu*y_bar) ) ; 



% évaluation of the latent variable KL divergence 
KL_q_mu = spm_kl_normal (m_mu, s_mu, m_mu_0 , s_mu_0 ) ; 

% évaluation of the parameter KL divergence 
KL_q_l = spm_kl_gamma (b_l , a_l , b_l_0 , a_l_0 ) ; 

% concatenate the négative variational free energy 
vFE = - ( l_av - KL_q_mu - KL_q_l ) ; 

end 



function [d] = spm_kl_gamma (b_q, c_q, b_p, c_p) 

% KL divergence between two Gamma densities 
% FORMAT [d] = spm_kl_gamma (b_q, c_q, b_p, c_p) 

% KL (Q| |P) = <log Q/P> where avg is wrt Q 

% b_q, c_q Parameters of first Gamma density 
% b_p, c_p Parameters of second Gamma density 
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% Copyright (C) 2008 Wellcome Trust Centre for Neuroimaging 
% Will Penny 

% $Id: spm_kl_gamma . m 2696 2009-02-05 20:29:48Z guillaume $ 
digamma_c_q = psi(c_q); 

d = (c_q-l) *digamma_c_q-log (b_q) -c_q-gammaln (c_q) ; 

d = d+gammaln (c_p) +c_p*log (b_p) - (c_p-l) * (digamma_c_q+log (b_q) ) ; 

d = d+b_q*c_q/b_p; 

end 

function [d] = spm_kl_normal (m_q, c_q, m_p, c_p) 

% KL divergence between two multivariate normal densities 
% FORMAT [d] = spm_kl_normal (m_q, c_q,m_p, c_p) 

% KL (Q| |P) = <log Q/P> where avg is wrt Q 

% between two Normal densities Q and P 

% m_q, c_q Mean and covariance of first Normal density 
% m_p, c_p Mean and covariance of second Normal density 



% Copyright (C) 2008 Wellcome Trust Centre for Neuroimaging 
% Will Penny 

% $Id: spm_kl_normal .m 2696 2009-02-05 20:29:48Z guillaume $ 

d=length (m_q) ; 
m_q=m_q ( : ) ; 
m_p=m_p ( : ) ; 

Terml=0 . 5*spm_logdet (c_p) -0 . 5*spm_logdet (c_q) ; 
inv c p=inv(c p) ; 

Term2=0 .5*trace (inv_c_p*c_q) +0 .5* (m_q-m_p) ' *inv_c_p* (m_q-m_p) ; 

d=Terml+Term2-0 . 5*d; 

end 

function [H] = spm_logdet (C) 

% returns the log of the déterminant of positive semi-def inite matrix C 
% FORMAT [H] = spm_logdet (C) 
% H = log (det (C) ) 

% spm_logdet is a computationally efficient operator that can deal with 
% sparse matrices 



% Copyright (C) 2008 Wellcome Trust Centre for Neuroimaging 
% Karl Friston 

% $Id: spmJLogdet.m 4068 2010-09-07 16:50:08Z ged $ 
sw = warning ( ' of f ' , ' MATLAB : log : logOf Zéro ' ) ; 
% assume diagonal form 



TOL = le-16; % cf. n*max(s)*eps 

n = length (C) ; 

s = diag (C) ; 

i = find(s > TOL S s < 1/TOL); 

C = C ( i , i ) ; 

H = sum (log (diag (C) ) ) ; 

% invoke det if non-diagonal 



[i j] = find(C) ; 
if any(i ~= j) 

n = length (C) ; 

a = exp (H/n) ; 

H = H + log (det (C/a) ) ; 

end 

% invoke svd if rank déficient 



if -isreal(H) || isinf(H) 
s = svd (full (C) ) ; 

H = sum (log (s (s > TOL & s < 1/TOL))); 

end 

warning (sw) 



52 



end 



function maximize ( f ig) 

% This function maximizes the figure with handle fig 

% Input 

% fig : figure handle of figure to be maximized 

% Original author: Bill Finger, Creare Inc. 

% Free for redistribution as long as crédit comments are preserved. 

% Copyright (C) Dirk Ostwald 



if nargin == 0 
fig = gcf; 

end 

units = get ( f ig, ' units ' ) ; 

set(fig, ' units ' , ' normal ized ' , ' outerposition ' , [ 0 0 1 1 ] ) ; 
set(fig, 'units', units) ; 

end 
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6 Mathematical détails of the tutorial example 
6.1 VB for the univariate LGSSM 

As outlined in the tutorial, we assume the following generative model 

p(yi:T'X 1: T,a,À x ,b,À y ) = p(y 1:T ,x 1:T \ a,À x ,b,À y )p(a,À x ,b,À y ) (6.1) 
where we define the following (conditional) independence properties 

p(a, À x , b, À y ) := p(a)p(A x )p(b)p(A y ) (6.2) 

and 

P(yi:r<*i:rl a,À x ,b,À y ) ■=p{x 1 )Y\\ =2 p{x t \x t _ 1 ,a,A. x )Y\ T t=iP(yt\xt,b,À.y) (6.3) 

The marginal or prior probability density functions on the right-hand side of (6.2) are assumed to be given as 

p(a) := N(a; n a , <r 2 ) = {InolY* exp (- ^ ( a ~ Ma) 2 ) (6-4) 
p{X x ) == G{X x ;a Xx ,b Xx ) = ^^A/^ 1 exp (- (6.5) 

p(b) :=N{b;n b ,ol) = (2noïyï exp (-^(6 - Mb ) 2 ) (6.6) 

PM == G (A y , % , hy ) = ^-^^ exp (- (6.7) 

The marginal or prior distribution and the conditional distributions on the right-hand side of (6.3) are assumed to be 
given by 

i 

pOi) : = Nfa.n^Axl) = (-^f exp -fej 2 ) (6-8) 

i 

p(x t \x t _ 1 ,a,A x ) := NO^ax^,/^ 1 ) = (^) 2 exp (-y (x t - ax^) 2 ) fort = 2, ...,T (6.9) 

i 

p(y t |x t ,M y ) :=N(x t ;6x t ,A- 1 ) = (^) 2 exp(-^(y t -6x t ) 2 ) for t = 1 T (6.10) 



The set of known prior parameters is thus given by 

n •■= {Ma.^,a Àx ,b Ax ,n b ,ai,a Xy ,b Xy ,n Xl ,À Xl } (6.11) 

We assume the following factorized form of the variational approximation to the posterior distribution over the 
unobserved variables 

p{a,X x ,b,X y ,x VT \y VT ) « q(a)q(A x )q(b)q(A y )q(x 1:T ) (6.12) 

According to the variational Bayes theorem for factorized posteriors (équation (17) of the tutorial), the variational 
free energy is maximized by setting 

q(tft)ocexp(/(7(tf v )ln P (y,tf)d^ t ) (6.13) 
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For the current example, we have 

y := y VT andi9 := {a,A x ,b,A y ,x 1:T } (6.14) 
where the variational distribution over unobserved variables partitions according to 

q(fi) ~ q(a)q(A x )q(b)q(A y )q(x 1:T ) (6.15) 
We thus obtain the following update équations, using the short hand notation 

(/(*)>?(*) : = / fix)p(pc)dx (6.16) 

for expectations: 

q^ +1 )(a) a exp ((lnp(y 1:T ,x 1:T ,a,^,6,A y )) (? (o (A;c)£? (o W q(0 (Ay)£? (o (Xl:T )) (6.17) 

q( i+1 \A x ) a exp {{\np(y 1:T ,x 1:T , a,A x , b, A y )> ( ,a + i) (a)? (o (6)? (0 (Ay)? (o (3ei:I . ) ) (6-18) 

« exp ((lnp(y 1:T ,x 1:T ,a,A x ,6,A y )> £?(i+1)(a)£?a+1 ) (A;c)£?( o (Ay)£? (o (Xi:T) ) (6.19) 

q a+1) 0 y ) « exp (<lnp(yi:r,x 1:r ,a,A x ,M y )) q(i+ i) (a)£? ( i+ i) (A;e)£? ( i+ i) W£?(i ) (%ir) ) (6.20) 

9 (<+1) (*l:T) K eX P (( ln P(yi:T^l:T<a<^^^y))qa+i) (a) qa+i) ( A ;c )q(^i)(ft)q«+i)(A y )) ( 6 " 21 ) 

For (6.17) - (6.20) we define the following variational distributions for VB itérations i = 0,1,2, ... 

q© (a) := N(a; M « a a 2(0 ) = (ina^f' exp (- (a - M «) 2 ) (6.22) 

:= G (^;a«fc®) = * ^V^expf-^) (6.23) 

q<Hb) :=N{b;ti\oî {i) ) = (ina^f exp (- ^ (* - M?)') (6.24) 
:= G(A y ;a« ô«) = ^_^ T A y a S- 1 exp(-^) (6.25) 

Ay 

Inference of q ( - l+1 - ) (x 1:r ) is discussed in détail in Section 6.3 of the supplementary material. Note that the (known) 
prior parameters of (6.11) and the (to be evaluated) parameters of the variational distributions over unobserved 
variables are distinguished by the (i) superscript 

6.2 Evaluation of q{d)q{A x )q{b)q{A y ) 

To dérive the update équations for the parameters governing the variational distributions of the unobserved 
variables a,A x ,b and A yi we take the following approach: We first reformulate the expectation of the joint 
distribution of the LGSSM and parameters, p(yi :T , x 1:T , a, A x , b, Ay), under the variational mean field approximation 
<7^(a)<7W(/l x )(7(0(ô)q( m )( / l y )q( n )(x 1:T ). We can then in turn dérive the update équation for each variational 
distribution (6.22) - (6.25) based on the variational Bayes theorem by ignoring its own contribution to the 
expectation. In the évaluation of the variational expectation of p(yi : T,x 1:T , a, A x , b, Ay) we use the generic 
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superscripts h,j, k,l,m £ M in lieu of the step-wise updated superscripts i and i + 1 to dérive the gênerai resuit and 
subsequently replace h,j, k, l, m by the appropriate itérative indices. 



o Evaluation of the variational expectation of p{y\-T, 
We have by définition 

<lnp(y 1:Tj x 1:T , a, A x , b, Ay)) q U) (a)q m (Ax)q (i) (b)q (m) (Ay)q (n) (Xi ^ (6.26) 

= On p(y 1:T , x 1:T \a, À x , b, À y )p(a, À x , b, ^ y )) q (i) (a)q (k) (Àx)q (i) (b)q (m) (ly)q (n) (x ^ 
Under the assumption of a factorized prior distribution 

p(a, À x , b, À y ) = p(d)p(A x )p(b)p(A y ) (6.27) 

and with the likelihood 

Piyv.T x 1:T \a, À x , b, À y ) = p(x x ) rit=2 P(x t \x t _ v a, À x ) ULi p(Jt \ x t> b, A y ) (6.28) 

we obtain 

(lnp(y 1:T , x 1:T , a, À x , b, ^ y )) q (j) (a)q w (Àx)q (o (b)q (m) (Ày)q (n) (x ^ 
= (lnp(x 1 )) £? (n )(Xi) 

+ St=2<lnp(x t |x t _ x , a, * x )) q (j) (a)q m ax)q (n) (Xt _ i:t) 
+ Zt=i(ln p(y t \x t , b, Ay)) q u) (b)q ( k)(Ày)q (n) (Xt) 

+<lnp(a)) q0 -) (a) 
+(\np(A x )) q(k)(Àx) 
+<lnp(è)) q(i ) (b) 
+(lnp(Ay)) q(m)(Ày) 

(6.29) 

Next, substituting the respective forms for the probability density functions, we obtain 

(ln p(y 1:T , x 1:T , a, À x , b, Ay))^;)^^^)^^^^^™)^^™)^^^ 

= ^(fe)'-p(-¥(^-^) 2 ))v")(, 1 ) 

+ 2X2 (ln (g) 2 exp (- À f (x t - ax,^) 2 )))^;)^),^)^),^)^^) 

+ 2X1 (ln (^£f exp (- \ (y t - ôx t ) 2 )^ (ft ) (Xt)£? y )(b)£?(fc ) (Ay) 
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+ <ln fëj K^ /l/i '" 1 exp ( _ %)) > « w ^> 

(6.30) 

Evaluation of the logarithmic expressions then yields 

(ln p(y 1:r< X 1:T , a, 6, ^y)>qO) (a) q(fc) (A;c) q(i) (&) q( m ) (Ay) q(n) (%i:T) 
= ï ln fe)-ï<^i(^-^i)V ) &Ci) 

+ Ê) V^) - i2F=iUy(y t - to ^ 2 ),a) W£? ( fc ) (Ay)£? (n )(;Ct) 

-ln (r(a A j) - ln((ô A J aA *) + (a Ax - l^lnA*)^^ - 
- (^2 (.b - H b ) 2 ) q m ib) - \ln 2nol 

-ln (r (a Ay )) - ln ((ô Ay ) * y ) + (a Ay - l) (lnA y > q(m)(Ay) - ^<A y > q(m)(Ay) 

(6.31) 

Finally évaluation of the quadratic terms yields 

(ln p(y 1:T , X 1:T , a, À x , b, ^)) q U)(a)q»H^)q^b)q im \^)q w 0ci^ 

+ 1 < ln Ê) V fc) (A,) - I <^cŒL 2 * 2 - 2a Z[=2 *t-i*t + a 2 Z[= 2 ^-i)) q O0 (a)£? (« (A;e)q (n) (;Ct _ 1:t) 

+ ^ ln Ê)V m) (A y ) " ï<*yŒt=iyt - 26 2t=iyt«t + ^ 2 2t=i*t)> (ï a)( i ,) (l e»)(A y ) (I c»)0c t ) 
- ^|(a 2 - 2a/i a + Ma 2 )qO) (a) - £ln 27tct 2 

-ln (r(a A j) - a Ax Info,) + (a Ax - l)(lnA x ) q(fc)(A;e) - ^U%>q« (A;e ) 
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- ^| <b 2 - 2bix b + Lib 2 ) q m (b) ~ ^ln 2naj; 

- ln (r (a Ay )) - a Ay ln(ô Ay ) + (a Ay - l) (\nA y ) q(m)(Ày) - -^^y) q m^ 

(6.32) 

□ 

o Application of the Variational Bayes Inference Theorem 

Based on (6.32), we now apply 

p(0 t ) a exp(J"q(0 v ) ln p(y,0)d0 v ) = exp (<ln(p(y,i9))> q(l9v) ) (6.33) 

for each variational parameter distribution q{a), q(A x ), q(b), and q(A y ). Therefore, starting with (6.32), we ignore 
the expectations under the respective variational distribution and we subsume ail terms independent of the variable 
under considération in the proportionality constant. 

o Dérivation of an update équation for q(a) 

Setting k, n = i we have 

lnq (i+1) (a) « - \{I,t=2^xXt - 2aJ 1 J =2 A x x t _ 1 x t + a 2 I l J=2^ x xf-i) q &^ x)q & <ixt _ 1:t) ~ ^| (a 2 ~ 2afi a + ni) (6.34) 
Partitioning the expectations, we obtain 

lnq(a) oc - i \J,J =2 (À x xf ) q Hx x)q V)( xù - 2aJ3 =2 (A x x t - 1 Xt) q i (Mq m iXt _ 1:û + a 2 ELzUx^t-i),^),»^)) - (« 2 - 2a/i a + tâ) 

(6.35) 

Rewriting the above as a quadratic form in a and ignoring terms independent of a yields 

ln q{a) oc - \ (^U^U)^^^ + ^) a 2 + (SL 2 < Wi^ce^e^) + g) a (6.36) 

Using 

U^t-iV)^®^) = <^>«,C0(A,)^t-i> (I C0 (Xt _ l) = a£V°<*t-iW t -i) (6-37) 

and 

(^-AVo^M = <^>q®^)^t-i^)q©fe_ 1:t ) = 4l h ^^t-i^t) q ^ iXt _ vt) (6.38) 
évaluation of the remaining expectations in terms of the parameters of q^(A x ) then yields 

ln q(a) oc - \ [afb.f ft^U) q v M + ^) a 2 + (a®*,® EL^-^)^^ + g) a (6.39) 
According to the completing-the-square theorem (Supplément Section 2), 

exp (- i aa 2 + /?a) « JV(a; a' 1 fi, a -1 ) (6.40) 

we thus have 
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q« + Ha)«N(a;v2 +1 \o? +X >) (6.41) 

where 

°f +1) ■■= n-- 2 ( X U) q « M + (6-42) 

and 

^ ■= (af x b,f ZLA.-A Vo (Xt _ 1:t) + g) (6.43) 

□ 

o Dérivation of an update équation for q(A x ) 

From (6.32), using the expectation partitioning as above, reordering in terms of \nA x and A x and setting j = i + 1 
and n = i we have 

lnq (i+1) (AJ oc g + % - l)l„4 - g(2L 2 <A:t 2 >,a) w - 2Zr= 2 <« t -i* t >,^ (a) , ( o 0ct . 1:t) +Zr= 2 <a 2 *?_ 1 ), (1+IJ(fl) , ( o Clt . 1:t) ) + (6.44) 

Taking the exponential on both sides then yields 

?< l+1 >C^)«4h* _l) «p(-(£ + K^^ (6.45) 

Using 

^ aX t-l X t>qa+i) (a) q(0 (Xt _ 1:t ) = ( a >qa+D(a)( x t-l x t>q(0( Xt _ 1:t ) = i"a +1) ( X t-l X t>q«( Xt _ 1:t ) (6-46) 

and 

évaluation of the remaining expectations in terms of the parameters of q^ l+1 \a) then yields 

q^Hh) oc if ^ exp (- (jL + i(S[ =2 <* t 2 Vo (X£) - 2 E[ =2 f4' +1) <*t-i*t W l!t) + YJt-2 + nT^) <*?-i Wj))*,) 

(6.48) 

Up to a normalization constant, is thus given by a Gamma distribution 

q^\A x ) « A"^- 1 exp (-^fïï) (6.49) 

with 

a l +1):=T 2 + a * X ( 6 - 5 °) 

and 

^ == (^ + Kï«^ 2 Wd + + +^ +1)2 )Zï:2(x?-l>,(0 0tM) )) (6-51) 

□ 
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o Dérivation of an update équation for q(b) 
Setting m = n = iwe have 

lnqOO a -^(ELi* 2 - 2bY l T t=1 y t x t + b 2 Z[=iX t 2 )> q( o (Ay) ,(0 (Xt _ 1:t) - ^(> 2 - 2bn b + ni) (6.52) 
Partitioning the expectations, we obtain 

\nq(b) oc -i a y > q(0(Ay) ZF=i y t y t - 2b Z T t=iyttt y x t ) q (i)( Ay)q (i) (Xt) + ^ 2 S[=iU y ^ 2 > î (0 (Ay ) î (0 (;C£) -^i(è 2 - 2bn„+nl) ( 6 -53) 
Rewriting the above as a quadratic form in b and ignoring terms independent of b yields 

Using 

<Vt ) q (0 (Ay)q (0 fe) = Uy> £? (o (Ay) (x t 2 ) £? (o (%t) = ag^fo 2 ) q m ixù (6.55) 

and 

<^y^t>q(0( Ay )q(0 (Xt) = (*y) q V( h )(Xt) q m ixù = a ? y h y l) (x t ) q(i)(Xt) (6.56) 
évaluation of the remaining expectations in terms of the parameters of q^(A.y) then yields 

InqOO a " I[ =1 <x 2 ) q( o fe) + ^) b 2 + (a®l*«> SF-itt^),»^ + g) (6.57) 
According to the completing-the-square theorem, 

exp [~\ab 2 + pb) « N(b; a -1 /?, (6-58) 

we thus have 

q^\b) oc N(b;^ +1 \ a (6.59) 

where 

af +1) := (a^b,f ZÏ =1 ( X ï) q U iXt) + ^ (6-60) 

and 

■= *f +1) (agbtf JS-iyM,,^ + g) (6-61) 

n 

o Dérivation of an update équation for q(Ay) 

From (6.32), using the expectation partitioning as above, re-ordering in terms of \r\A y and A y and setting l = i + 1 
and n = i we have 

lnq( i+1 )(l y ) oc g + a Ày - l)lnA y - (i(S=iy t 2 - 2 E[ =1 y t <te t >,a +1 ) (6)î (o (X£) + ELi<è 2 % t 2 >,« +1 ) (fc)? (o (X£) ) + ^)l y (6.62) 
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Taking the exponential on both sides then yields 

q^\A y ) oc if^- 1 ) exp (- (i(SF =1 y t 2 - 2 r t=1 y t (bx t ) q(i+1)(b)qU)(Xt) + lU(b 2 xf) q a +lKb)q w Cxt) ) + ±) A y ) (6.63) 

Using 

(bx t ) q a + i) (b)q (i) (Xt) = (b) q (i + i) (b) (x t ) q m (Xt) = rf +1) < x t>q« (Xt) (6-64) 

and 

(b 2 xï ) q a+i WHxt) = (b 2 ) q a+D W {xï) q ( Hxt) = {of +1) + rf +1)2 ) (xf) q w (Xt) (6.65) 
évaluation of the remaining expectations in terms of the parameters of q^ l \b) then yields 

a exp (- (i (z^* 2 - 2^ +1) ZÏLiy t <*Vo w + (^ 2 " +1) + 4 i+1)2 ) ZLi^W,,) + ^) A y ) 

(6.66) 

Up to a normalization constant, q^ +1 ^(/l y ) is thus given by a Gamma distribution 

q^\X y ) « A< +1) - X exp (- ^ (6.67) 

with 

a C 1):= ^ + % (6 - 58) 

and 

== (^ + l(^=^ - 2^ +1) ZL 1 y t (x t > £? (0 (Xt) + +rf +1)2 )ZL 1 (x t 2 ) £?( o (;Ct) )) (6.69) 

n 

In summary, we have obtained as set of itérative update équations for the variational distribution parameters which 
can be initialized using the parameters of the marginal distributions (6.4) - (6.7) and is shown in pseudocode form in 
Table 4. Notably, most update équations involve sums over expectations of the unobserved variables x 1:T under the 
variational distribution q^ l \x 1:T ). The dérivation of thèse expectations in parameterized form is the topic of Section 
6.3.1.5 of this supplementary material . 
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% initialization of the variational distribution parameters to the parameters of the prior distributions 

a 2(0) := a 2 

(o) ._ 

Ma — Ma 

(o) 

a := ai 

bf :=b À 

rr 2(0) - n 2 
(o) 

M& : = Mb 

<> =% 



% itérative update équations for the variational distribution parameters 
for i = 0,1,2 ... 



( +1) T 
l, := - + a; 



Table 4. Variational parameter update équations for the unobserved variables a, ft, A* and 
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6.3 Evaluation of qr (i+1) (*i : :r) 

We first validate the relationship between the proportionality statements of équations (28) and (29) of the tutorial: 
q (i+1) (x 1:T ) a exp(<lnp(y 1:r< x 1:r< 0)) fl (i + i )(fl) ) ^ q a+1) (x 1:T ) « exp (<lnp(x 1:T |y 1:T , 0)> ? a+D (e) ) (6.70) 
o Dérivation of the équivalence (6.70) 
We hâve 

<7 (l+1) (*i:r) k exp((lnp(y 1:T ,x 1:T ,6i)> q a +1 ) (0) ) 

= exp ((\n(p(y 1:T ,x 1:T \6)p(6))) q a + i) m ) 

= exp ((lnp(y 1:T ,x 1:T |6») + lnp(6>)) q (i +1)(0) ) 

= exp ((lnp(y 1:T ,x 1:T |6»)> q (i +1 ) (0) + (lnp(ô)) (? (i + i) (0) ) 

= exp ((ln p(y 1:T ,x 1:T | 6)) q(i+ exp ((lnp(0))> £?a+ i) (0) ) 

(6.71) 

The latter term in the product on the right-hand side of (6.71) is independent of x 1:T and may thus be regarded as a 
(further) positive multiplicative constant of the probability density governing the x 1:T . We thus have 

<Z a+1) Oi:r) « exp ({\np(_y 1:T , Xl:T \e)) qC i + i) m ) (6.72) 

We further have 

<Z ( ' +1) Oi:r) k exp ((lnp(y 1: ^x 1:T |6O>q(i+i) (0) ) 

= exp ((ln(p(x 1:T |y 1:T , 0)p(y 1:T |6i))> q a +1 ) (0) ) 

= exp ((lnp(x 1:T |y 1:T ,6i) + lnp(y 1:T |6»)> q (i +1 ) (0) ) 

= exp ((lnpCx^lyi^e))^!)^ + (\r\p(y 1]T \9)) q(i+1)(e ^ 

= exp ((lnp(x 1:T |y 1:T ,0)> £?a+1)(0) )exp((lnp(y 1:T |0)> £?a+1 ) (0) ) 

(6.72a) 

Again, the latter term in the product on the right-hand side of (6.73) is independent of x 1:T and may thus be 
regarded as yet another positive multiplicative constant of the probability density governing the x 1:T . We thus have 

9 a+1) (*i:r) K exp^lnpCx^lyi^e))^!)^) (6.73) 

□ 

As discussed in Section 4.2.2, inference of exp((lnp(x 1:T |y 1:T , S)) q (i+i)^ g ^j can be achieved by capitalizing on classic 
state space model inference algorithms. As in the tutorial, we thus first review inference of p g (x 1:T \y 1]T ), 
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i.e. the évaluation of of p(x 1:T \y 1:T ) for known, non-random values of 8 ■= (a, a Xl b, C7 y ) . To this end, we first verify 
équation (31) of the tutorial, i.e. the re-parameterization of p(x 1:T |y 1:T ) in terms of a product the ratio between its 
pairwise conditional marginal distributions p(x t _ 1 ,x t \y 1 . T ') and its conditional marginal distributions p(x t _ 1 |y 1:T ): 

P(*l:r|yi:r) = P(Xl\yi-.Tmï=2 Pi 2 1 T 1 ? (6 " 74) 

Pl x t-l\yi:T) 

o Dérivation of the équivalence (6.74) 

From the définition of conditional probability, we have 

P(.x 1:T \y 1:T ) = p(x T |x 1:r _ 1 ,y 1:r )p(x r _ 1 |x 1:T _ 2 ,y 1:r ) -p(x 2 \x 1 ,y 1:T ) p(x x | y 1:T ) 

= P(^i|yi:r)nr=2P(^ki:t-i<yi:r) (6.75) 

From the defining properties of the LGSSM, we have that given x t -±, the random variable x t is conditionally 
independent of ail other latent variables x 1:t _ 2 and x t+1 . T (t = 3, ..,T — 1) . We may thus discard the variables 
x i-.t-2 f rom eacn term ]n tne product of the right-hand side of équation (6.75). Likewise, the random variable x x was 
assumed to be conditionally independent of x 2:T (or in other words, defined by its marginal distribution). We thus 
have 

P0i:r|yi:r) = P(*i|yi:r)n[=2P(>tl*t-i<yi:r) = POi \yi-.r) 111=2 g Srn^TT (6 " 76) 

Pl x t-l\yi:T) 

□ 

Next, we provide a gênerai review of the Kalman-Rauch-Tung Striebel Algorithm. Because, in contrast to the scénario 
discussed in the tutorial, standard treatments of LGSSMs consider multivariate latent and observed variables, the 
following treatment uses the multivariate form of the LGSSM. 

6.3.1 The Kalman-Rauch-Tung-Striebel Smoothing Algorithm 

For LGSSMs, the évaluation of conditional distributions over the variables x 1:T given an observed data 
séquence y 1:T and a known parameter 6 corresponds to the "state space model inference" problem for which a 
variety of solutions exist in the literature (for a comprehensive review see e.g. (Briers, Doucet, & Maskell, 2010)). The 
most popular solution for inferring the distributions of x 1:T is perhaps the KRTS smoother, which represents the 
combination of the classic Kalman Filter for state space models (Kalman, 1960) with a backward algorithm (Johari, 
Desabrais, Rauch, Striebel, & Tung, 1965). In brief, the KRTS smoother is a recursive algorithm that yields the 
expectation and covariance parameters of the conditional unobserved variable distributions in matrix product form. 

In contrast to the tutorial, we here consider the gênerai form of a "multivariate" LGSSM. To this end, we 
dénote the dimensionality of the unobserved variables by k £ M and the dimensionality of the observed variable by 
p EN and rewrite expression (5) of the tutorial as 

x t = Axt-t + E t (x t £ R k , A £ R kxk , e t ~ N(e t ) 0, 0 £ R k , I x £ R kxk , I x > 0) (6.77) 

y t = Bx t + T] t (y t £ R P ,B £ R pxk , ï] t ~ N(j] t ; 0, S y ), 0 £ R p , I y £ R pxp , I y > 0) 

for t = 2, ...,T and t = 1, ...,T, respectively, and with the assumption of independent dynamics and observation 
noise, that is, C(e t , rj t ) = 0 £ R kxp . It is central for the understanding of LGSSMs that thèse two équations specify a 
multivariate joint Gaussian distribution over the unobserved variables x^, ...,Xj and the observed variables 
y 1( ...,y r . Specifically, due to the Gaussian properties of s t and r] t , the équations of (6.77) specify the following 
conditional probability distributions over unobserved and observed variables x t and y t , respectively 
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p(* t |*t_i) = N(x t ;Ax t - 1 ,ï x ),t = 2 Tandp(y t \x t )=N(y t ;Bx t ,2y),t = 1 T (6.78) 

The spécification of the joint distribution for the LGSSM is complétée! by a further Gaussian distribution over x x with 
expectation parameter \i x £ R k and covariance S x £ R kxk 

p(x 1 ) = N(x 1 ; Hj_.lt) (6.79) 

Usingthe "colon notation" abbreviations 

x 1:T : = (x lt ...,x T ) andy 1:T : = (y x , ...,y T ) (6.80) 

and parameter set définition 6 ■= {pL-^, 2 1( A, S z , S, 2 y } then allows for writing the joint distribution over unobserved 
and observed variables as 

Pe(*i : r.yi:r) = Pmi,Zi( x i) ÏÏt=2 PA£ x { x t\ x t-i)ÏÏt=iPB? y (jt\xt) ( 6 - 81 ) 

As discussed in the tutorial, expression (6.81) states that the joint distribution over ail variables x 1:T and y 1:T of the 
LGSSM is given by the product of conditional Gaussian distributions over x 1:T and ,y 1:T , respectively. Because the 
product of Gaussian probability densities is again a Gaussian probability density, (6.81) hence spécifies a Gaussian 
joint distribution by means of its factorization properties. 

The Kalman-Rauch-Tung-Striebel (KRTS) smoothing algorithm, as sketched herein, is an algorithm that, given 
a parameterized 2T-dimensional Gaussian joint probability p(x 1:T ,y 1:T ) over random variables x 1:T and ,y 1:T with 
spécifie factorization properties and parameterized by the parameter vector 8, returns the parameters of the T- 
dimensional conditional distribution p(x 1:T \y 1:T ). However, as outlined in the tutorial, the parameters of 
p(x 1:T \y 1:T ) are returned not in the form of a T-dimensional expectation parameter H Xl . T \ yi . T and TxT- 
dimensional covariance parameter Z Xl . T \ yi . T , but as the parameters of the conditional "marginal" distributions 
Vei x t\yi-.T) f° r t = 1, ...,T and the conditional "joint" distributions p(x t _i,x t |yi : r)- Because, as shown above, by 
the conditional independence properties of the LGSSM, we have 

p(x 1:T \y 1:T ) = pix 1 \y 1]T mJ=2 Pi 2~ 1 T 1 'f ( 6 - 82 ) 

Pl x t-l\yi:T) 

thèse two parameterizations are équivalent. Following the classic treatment of state space model inference, in this 
Section, we first review the Kalman filter, a recursive algorithm to obtain the parameters of the distributions 
p(x t \y vt ) in Sections 6.3.1.1 and 6.3.1.2. We then turn to the KRTS smoothing algorithm in Section 6.3.1.3. Based on 
the results of the Kalman filter algorithm, the KRTS smoothing algorithm allows for the dérivation of the parameters 
of the distributions p(x t |y 1:T ) in the denominator of the right-hand side of (6.83). Finally, in Section 6.3.1.4, we 
review the recursive évaluation of the parameters p(x t _ 1 ,x t |y 1:T )required in the numerator of the right-hand side 
of (6.83) and close this section with a summary in 6.3.1.5. 

6.3.1.1 Inference of p(jc t |y 1:t ) 

Inferring a probability distribution over the unobserved variable x t given the values of ail observed variables 
up to y t , Le., p(x t \y vt ) is performed readily due to the analytical properties of Gaussians distributions. As the joint 
distribution over x 1:T and y 1:T is Gaussian, ail marginal and conditional marginal distributions are also Gaussian 
distributions. Further, as Gaussian distributions are characterized by their first two central moments, Le., their 
expectation and covariance parameters, it is only thèse parameters that have to be inferred to fully characterize the 
filter distribution of interest 

p(.x t \y 1:t ) = N(x t ;n Xtlyi . t ,-Z Xtlyi:t ) (t = 1 T) (6.84) 
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The aim of the following section is to find a recursive expression for the conditional expectation and covariance 
parameters Mx t |y l t e ^ k ar, d ^x t |y l t e IR kxk of (6.84) in terms of the parameters of the "preceding distribution" 

POt-llyi:t-l) = ^( x t-i;fte t _ 1 |y 1:t _ 1 ^* t _ 1 |y l!t _ 1 ) ( f = 2 < -> T ) ( 6 - 85 ) 

Formally, this can be achieved in two steps (in the dérivation of the Kalman-filter équations, we follow (David Barber, 
2012)): first, by finding the parameters of 

P(*t,ytlyi:t-i) = N (Cy t y^x t ,y t \y 1 . t ^x t ^\y 1 .. t ^ (6-86) 

in terms of the parameters of the distribution pOct-i<yt-i|yi:t-i)> anc ' second, by conditioning p(x t ,y t |y 1:t _ 1 ) on 
y t , yielding the parameters of the distribution of interest. The first step can be achieved by considering the 
partitioning of the expectation and covariance parameters of p(xt>yt\yi:t-ù> which are given by 

V* t ,y t \y, t -, ~ U yt |y 1:t _ J ~ { E(y\y^) ) R (6 " 87) 

and 

_ / 2 x t ,x t |yi:t-i ^xt.yt\yi-.t-i\ _ (^(. x t> x t\yi:t-i) ^i x t>yt\yi-.t-i)\ f- H5(fc+p)x(fe+p) /f-oo\ 

^ Xt ,y t \y, t -, ~ Uy^iy^, h t , yt]yi ,J " U(y t , x t \y 1 , t _ 1 ) Ciy^yAy^)) K (6 - 88) 

Recursive expressions of (6.87) and (6.88) in terms of the parameters of p(x t _ 1 ,y t _ 1 |y 1:t _ 2 ) can be obtained by 
means of the linear transformation theorem for Gaussian distributions, yielding (see below for détails) 

and 

/ AI X , v A T + I X B T (AI X , v A t + ï, x ) \ 

W^-dy^-^ + 2 *) B^Vil W* T + I X )B T + lyj 
o Dérivation of équations (6.89) and (6.90) 



The first component of f i x t ,y t \y 1 . t - 1 represents the expectation of the random vector x t conditioned on the 
observations y^x-i- Based on the dynamics équation of the LGSSM it is thus obtained by a linear transformation of 
the expectation of the random vector x t _ x conditioned on the observations y 1]t -i under the transition matrix A and 
additive zero-mean noise e t , in other words 

^ t |yi:t-i = E ( x tlyi:t-i) = EG4x t -i + £ t lyi:t-i) = EG4xt-ilyi:t-i) = AEix^y^^) = Afi Xtilyi . ti (6.91) 

Likewise, the expectation of the observation y t conditioned on the observations y 1]t -i is given by the expectation of 
the linear transformation of x t under the émission matrix B and additive zero-mean noise r\ t ; and further, x t is itself 
given by the linear transformation of x t _ 1 under A, andd additive zero-mean noise s t in other words 

i"y £ |yi :t -i = E (ytlyi:t-i) = E(Bx t + Vt\yi-.t-i) = E(5(i4x t _! + £ t )lyi :t -i) = BE(Ax t . 1 + £ t |yi :t _i) = 5i4E(x t _ 1 |y 1:t _ 1 ) = BA[i Xt ^ _ t 

(6.92) 

Concaténation then yields the desired recursive expression for the expectation parameter of p(x t ,y t |y 1:t _!) 
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^,y t \y, t -, ~ {bZ^I) (6 ' 93) 



The first component of S z y \ y represents the covariance of the random vector x t with itself conditioned on 
yi:t-i, ( ^{x t ,x t \y 1 . t _ 1 ). As x t is given as the linear transformation of x t _ x under additive zéro mean noise s t with 
covariance matrix l x , the Gaussian linear transformation theorem (conditioned on yi :t _i) applies. The expression 
for the covariance of x t with x t in terms of the covariance of x t _ x with x t _ x is thus given by 

C(* t ,*t|yi :t -i) = AZ Xt _ llyi . t _ 1 A T + Z x (6.94) 

Evaluating the covariance of y t with y t in terms of the covariance of y t _ x with y t _ x yields 

c(y t , y t |yi :t -i) = <c(s* t + rç t , Bx t + rç t |y 1:t _ x ) (6.95) 

= CiBxt.Bxtly^J + Xy 

= BC(x t ,x t \y 1:t _ 1 )B T + l y 

= S (^ t _ 1 | yi:t _ 1 ^ + ^ + I y 
Finally, évaluation of the covariance of x t with y t in terms of the covariance of x t _ x with y t _ x yields 

<C(x t ,y t |yi:t-i) = C(^t-i + e t ,B(i4x t _i + Et) + VtlVi-.t-i) = {^ Xt _ llyi . t _ 1 A T + Z X )B T (6.96) 
and hence also 

c(y t ,*tlyi:t-i) = (<c(^<y t |yi:t-i)) T = B(Ai Xt _ ilyi:t _ i À r + z x f (6.97) 

Concaténation then yields the desired recursive expression for the covariance parameter of p(x t ,y t |y 1:t _ 1 ): 

/ Al Xf , V1 f A T + Iv (Al Xf , V1 f A T + I X )B T \ 

□ 

6.3.1.2 The Kalman filter équations 

The second step outlined above can be achieved by capitalizing on the Gaussian conditioning theorem (2.17), 
resulting in 

ftctlyi* = A ^ t -,\y, t -, + O^t-dy^-^ + *x)B T (B(AZ Xt _ l]yit _ 1 A T + Z X )B T + I y ) _1 (y t - BAfi^y^) 

(6.99) 

and 

(6.100) 

The expressions (6.99) and (6.100) for H Xt \ yi . t anc l ^% t |yi t a ' 30ve can formulated more succinctly by setting 

P t == C(x t ,x t \y 1:t _ 1 ) = AI Xt _ i]yi . t _A T + Z x e R kxk (6.101) 
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yielding 



and 



*x t \y 1:t =Pt~ P t B T {BP t B T + I y ) _1 SP t 
Finally, by defining the so-called Kalman Gain Matrix 

K t ■■= ^y, t -ïyly t \y, t -, = P t B T (BP t B T + I,)" 1 6 R**p 
the parameters may equivalently be expressed as 

and 

^ Xtlyi , = P t -K t BP t = (I-K t B)P t 



(6.102) 



(6.103) 



(6.104) 



(6.105) 



(6.106) 



To avoid that the inferred covariance 2 X | yi , which is given by the différence of two positive-definite 
matrices in (6.35) and, hence, is itself positive-definite, becomes non-positive-definite due to numerical rounding 
errors, Bucy and Joseph (Bucy & Joseph, 1987) proposed the following modified update rule, which includes the 
formation of a positive-definite matrix on every inference step, which is derived below: 



2 *tlyi:t = 0- K t B)P t = (/ - K t B)P t {l - K t B) T + K t Z y K? 



(6.107) 



o Dérivation of équation (6.107) 

Omitting subscripts to simplify the notation, we have 

(/ - KB)P(J - KB) T + KI y K T = (/ — KB)P(I - B T K T ) + KY. y K T 

I - KB)P - (/ - KB)PB T K T + KI y K T 
I - KB)P - PB T K T + KBPB T K T + Kl y K T 
I - KB)P - PB T K T + K{BPB T + ï. y )K T 

I - KB)P - PB T K T + PB T {BPB T + I y ) _1 (SPB T + ï, y )K T 
I - KB)P - PB T K T + PB T K T 
I - KB)P 

(6.108) 

□ 

To initialize the recursion of équations (6.105) and (6.106), we evaluate the conditional distribution p(xilyi) based 
on the parameters of the marginalization distribution p(x x ) and the conditional distribution p(yi|xi). We have 



H Xl \ yi ■■= H Xl + Z Xl B T (BÏ Xi B T + l y y\ yi - Bn Xi ) 



and 



(6.109) 



(6.110) 



pressions (6.109) and (6.110) and équation: 
replacing A/x Xt _ My ^ t i with pi Xi and (AZ Xt _ 1 \ yit _ 1 A T + I. X ) with S Xi in (6.109) and (6.110) results in équations 



Note the close relationship between expressions (6.109) and (6.110) and équations (6.105) and (6.106): merely 
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(6.105) and (6.106). Based on the dérivation of équations (6.109) and (6.110) below, we may thus interpret the 
Kalman recursion équations as a generalization of Bayesian inference in "static" linear Gaussian models. 



o Dérivation of équations (6.109) and (6.110) 

By définition of the LGSSM, we have 

p(xj) = N( Xl ;fi Xi ,I Xi ) andpOil*!) = N(y 1 ;Bx 1 ,2 y ) (6.111) 
The expresssions above specify a standard (static) Bayesian linear Gaussian model p(x 1( y-J of the form 

where 

(?) ■ inn 1 ) e ^ p+k and ( ^ ) e r(p+»*(p+V (6.113) 

KyJ'yBuxJ \BZ Xl BI Xi B T + ï,yJ v ' 

(see e.g. Bishop (2006), pp. 78 - 102). From the "marginalization and conditioning theorem" of Supplément Section 
2, we thus have 

ftcdyi = + *x 1 B T (B'Z Xl B T + I y ) _1 (y! - Bfi Xi ) (6.114) 

and 

2 *ilyi = S *i ~~ Z Xi B T (BÏ. Xi B T + Z y ) (6.115) 



Based on the initial équations (6.109) and (6.110), the recursive expressions (6.105) and (6.106) and the values of 6 
and y 1:Tl we may thus infer the conditional distributions p(x t |y 1:T ) over the unobserved variables x 1:T for 
t = 2, ...,T. However, as discussed in the tutorial in an offline inference scénario we may also infer the conditional 
distributions p(x t |y 1:T ). The parametric forms of thèse distributions can be obtained by combining the Kalman filter 
recursion with a backward recursion starting from p(x T \y T ) as discussed next. 

6.3.1.3 Kalman-Rauch-Tung-Striebel Smoothing- Inference of p(x t |y l r ) 

Like the conditional marginal distribution p(x t |y 1:t ) the conditional marginal p(x t |y 1:T ) of an LGSSM is a 
normal distribution and, hence, can be characterized by its first two central moments li Xt \ yi . T and ^ Xt \y VT - ln 
analogy with (6.84) and (6.85), the aim of the following section is to dérive the parameters of 

p(x t \y 1:T ) = N(x t ;n Xtlyi . T ,-Z Xtlyi:T ) (t = 1 7) (6.111) 

in terms of the parameters of the "succeeding distribution" 

POt+i lyi:r) = N(x t+1 ; Hx t+1 \y 1:T > ^x t+1 \y 1:T ) (t = 1 T - 1) (6.112) 

A backward recursion starting from the final resuit of the Kalman Filter recursion p(x T |y 1:T ) may be derived by 
noting that for the LGSSM the joint distribution over temporally adjacent latent variables conditioned on y 1:T may be 
written as (see below for détails) 

„f v v U, A _ Pfl(*t+ll*t)Pfl(*tlyi:t)Pfl(*t+llyi:r) tF.-\-\J\ 

P<,x t ,x t+1 \y 1:T ) - Pe( x t+1 |y 1:t ) (6 - 113) 
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In the dérivation of the KRTS update équations, we follow (Yu, Shenoy, & Sahani, 2004). 



o Dérivation of équation (6.113) 

We first note that 



Pix t ,x t+1 \y 1:T ) = p(x t |x t+1 ,y 1:T )p(x t+1 |y 1:7 0 (6.114) 



We next note that x t is independent of y t+1 given x t+1 , i.e. given x t+1 , x t and y t+1 are conditionally independent, 
and one may write 

P(*tl*t+i<yi:r) = V(.Xt\xt+i,yi-.t) (6.115) 

The conditional independence of x t and y t+1]T given x t+1 may be derived by using the d-separation criterion in the 
graphical model framework (p. 378 in (Bishop, 2007)): because every path from x t to y t+1 . T is blocked by the node 
x t+1 , i.e. the arrows on the respective paths meet head-to-tail at the node x t+1 (see Figure 3A of the tutorial), x t is 
conditional independent of y t+1 - T given x t+1 . In other words 

x t 1 y k+1 \x t+1 for k = t,t+ 1,...,T - 1 (6.116) 

We hence have 

p(x t ,x t+1 |y 1:T ) = p(x t |x t+1 ,y 1:t )p(x t+1 |y 1:T ) (6.117) 



Considering the first term on the right-hand side of the above, we have, again with the factorization properties of 
the LGSSM 

„r v \ v „ _ P(x t ,X t+1 \y 1 :t) _ P(Xt\yi:t)P(Xt+l\x t ,yi:t) _ POtlyi:t)POt+ll*t) /C11Q\ 

p{x t \x t+1 ,y 1:t ) - p(Xt+ilyit) - p(Xt+llyi . t) ~ p( x t+1 |y l!t ) l5 ■ 118, 
where the conditional independence of x t+1 and y 1:t given x t follows from 

p{.x t+ i,yi:t\x t ) - ^ - ^ - ^ - P(x t+1 \x t )p(y 1:t \x t ) 

(6.119) 

In the penultimate equality of (6.119), the conditional independence of y 1:t and x t+1 given x t can again be derived 
based on the d-separation criterion: ail paths from x t+1 to y 1:t are blocked by the node x t , or in other words, the 
respective paths meet tail-to-tail at the node x t 



Taking logarithms and substituting for the distributions on the right hand side of (6.113) then results in 



lnp(x t ,x t+1 |y 1:T ) = --(x t+1 -^x t ) T S x 1 (^t+i - Ax t ) 

~ j i x t ~ Mx t |y 1:t ) (j-x t \y 1:t ) (.*t ~ Mx t |y 1:t ) 



2 (x t+1 Hx t+1 \y 1:T ) (j-x t+1 \y 1:T ) (.X t +i ^x t+1 \ yi . T ) 
+ 2 ( x t+l ~ ^x t+1 \y vx ) ( S x t+1 |y 1:t ) (x t+1 - Mx t+1 |y 1:t ) + C 
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(6.120) 



where C £ R dénotes a a constant independent of x t and x t+1 . Rewriting the above in terms of first and second 
order in x t and x t+1 and comparing thèse to a standard Gaussian distribution partition allows for inferring the 
entries of the conditional covariance matrix ^ Xt ^ t+± \y ± . T - This conditional covariance matrix partitions according to 



2 -( ^ x ^y^T ^x t ,x t+1 \y 1:T \ _ ( £(x t , X t \y 1:T ) <£(x t ,X t+1 \y 1:T ) \ ^ 2 kx2k 

x t ,x t+1 \ yi .. T \Z Xt+iiXt]yi:T ï Xt+1 | yi!r ) \£(x t ,x t+1 \y 1:T ) C(x t+1 ,x t+1 \y 1:T )J 



(6.121) 



The diagonal entries of '%x t jc t+1 \y 1 . T in turn allow for inferring the following recursive KRTS update équations (see 
below for détails) 

ftetlyiT = Vx t \y 1]t + Z Xt \y llt A T (AL Xt \ yi:t A T + I x ) (^x t+1 \y 1]T ~ A ^x t \y vx ) (6.122) 

and 

(6.123) 

Notably, thèse recursive backward expressions for the smoothed expectation l* Xt \ yi . T ar, d covariance parameters 
^•x t \y VT contain only the filtered conditional expectation H Xt \ y± . t , the filtered conditional covariance ^ Xt \y vt > tne 
transition matrix A, and the state noise covariance S z . The ensuing algorithm is referred to as KRTS smoother. 

o Dérivations of équations (6.122) and (6.123) 

As outlined above, we dérive the KRTS update équations by means of the conditional covariance matrix 
Z XtAt+1 | yi 7 , following the approach of (Yu, Shenoy, & Sahani, 2004). This approach capitalizes on the following two 
matrix identifies 

{A + UCVy 1 = A' 1 - A^UÇC- 1 + VA~ 1 U)~ 1 VA~ 1 (6.124) 

and 

(/ - (A + B)- 1 A)B~ 1 = (A + fi)" 1 (6.125) 

for appropriately specified matrices A,B,C,U and V. (6.124) is known as the "Woodbury identity" or "Matrix 
Inversion Lemma". (6.125) is verified as 

(A + BY 1 (A + B) = I (6.126) 
<^> 04 + B) _ M + 04 + By^-B = I 

<=> (4 + fi) _1 B = I-(A + fi) _ M 
^ (A + fi) -1 = (/ - (A + B)~ 1 A)B~ 1 
<=>(/- (i4 + By 1 A)B~ 1 = {A + By 1 
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Further, the approach of (Yu, Shenoy, & Sahani, 2004) uses the formulation of the quadratic form in the exponent of 
multivariate Gaussian distributions as introduced in Supplément Section 2 and written here for 



p(z 1 ,z 2 ) = iv(g 2 1 );(^),A-) (6.127) 



as 



i ( z 1 - fi^T Mu A 12 \ (Z 1 ~ Mi\ , _ 



= - \ z l A n z i ~ \z\A X2 z 2 - \zlh 21 z r - \z\A 22 z 2 + zJ(A 21j u 1 + A 22 fi 2 ) + z[ (Au^ + A 12j u 2 ) + C 

(6.128) 

with suitably chosen vectors z 1 ,z 2 ,fi 1 ,fi 2 and block matrix A, denoting the inverse of the covariance matrix of 
p(z 1( z 2 ) and C £ M. In contrast to Supplément Section 2 we here use numerical indices to avoid confusion with the 
nomenclature for latent and observed variables in the LGSSM context. In addition (Yu, Shenoy, & Sahani, 2004) note 
that for the inversion of block matrices, the following equalities hold. Let 

^11 ^12\ . f A n A 12 \ 



and 



(^21 ^22) (-^21 ^22) ^ ^ 



Tu == A X1 - A 12 A 2 ^A 21 and T 22 := A 22 - A^A^A^ (6.130) 



Then 



(An A i2\~ 1 = ( r îi -rr^A^AïA 

U21 a 22 ; {-a-aa^ 1 V£ ) 



A r \ + A r \ A 12 r 22 1 A 21 A 1 i T 11 1 A 12 A 22 

— A 22 A 21 T 11 A 22 + A 22 A 21 r xl A 12 A 22 



(6.131) 



which may be shown using standard matrix algebra. 

Based on the set of preliminaries (6.124) - (6.131), équations (6.122) and (6.123) may now be derived from 
(6.120) as follows. As a first step, the right hand side of (6.120) is reordered in terms of second-order terms in x t+1 , 
first-order mixed terms in x t+1 and x t , second-oder terms in x t and first-oder terms in x t . This reordering results in 

lnp(x t ,x t+1 |y 1:T ) = -ix t T +1 (2- 1 -(l Xt+i | yi:t ) 1 + % t+1 \y 1:T ) )*t+i (6-132) 
— 2 x t+i ( — 1 À)x t — - xj (— A T Ï, X 1 )x t+1 
-\xj{A T ^A + % t \ yi J 1 )x t 

+xj ((X t |y 1:t ) ftc t |y l!t ) + C 
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Term-by-term comparison of (6.132) and (6.128) then allows for inferring the entries of the inverse covariance 
matrix ofp(x t+1 ,x t \y 1:T ) to be given as 



p(x t+1 ,x t \y 1:T ) = 



(6.133) 

To obtain the covariance matrix of p(x t+1 ,x t |y 1 . r ), the block matrix on the right-hand side of (6.133) has to be 
inverted. To achieve this, and simultaneously dérive (6.123) (Yu, Shenoy, & Sahani, 2004), first simplify A22 and 
A22A21 in (6.131) in terms of the entries in (6.133). Specifically, using the matrix inversion lemma and (2.10) 

A 2 2 = (Â r Z x 1 A + 2 Xt ] yi;T ) = 2 Xt |y 1;t - ^x t \y 1]t A T (ï x + A ^ Xt \y VT A T ) AT. Xt]yi t 

= 1 x t \y 1 .. t -Zx t \y 1:t A T (Xx t+1 \y 1 .. t ) ^x t |y 1:t (6.134) 

and defining 

h ■■= ïx^Çx^yJ' 1 (6-135) 

we obtain 

A 22 = S x t |y 1:t -Jt^x t+1 \ yi Jt (6.136) 

Likewise, by means of the matrix identity 



we obtain 



(/- (A + B)- 1 A)B- 1 = (A + B)- 1 (6.137) 

Ai!A 21 = -(S Xt|yi:t -J t l Xt+i]yi J)A T ^ (6.138) 

= ~^x t \y 1:t A T (I - (2 X + AÏ. Xt \y i]t A T ) AI. Xt \y vt A T )lL x 1 

= ~^x t \y 1:t A T (j-x + AZx t \ yi . t A T ) 
= ~Jt 

After this simplification, the block-matrix on the right-hand side of (6.133) may be inverted to obtain the entries of 
the covariance matrix of p(x t+1 ,x t |y 1:T ) according to (6.131). Specifically, we have 

*x t+1 \y 1:T = rii 1 (6.139) 

and therefore 

ïx t \y 1:T ■•= Ai2 + Ai2 1 A2irfi 1 A 12 Ai 2 1 (6.140) 
= ( S ^tlyi:t - Jt^x t+1 \ yi .Jt) + (-A) 2 x t+1 |y 1:T (-/t T ) 
= ^x t \ yi . t + Jt(^x t+1 \y 1]T ~ ^x t+1 \ yi . t )jt 
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= 2 *tlyi:t + ( 2 ^tlyi:t^ r ( 2 ^t+ilyi:t) ) ( 2 *t+ilyi : r ~~ ^t+ilynt) ( 2 ^tlyi : t^ r ( 2 ^t+ilyi:t) ) 



Noting that with the Gaussian linear transformation theorem 



and thus 



2 *tlyi:t-i ~~ ^^t-ilynt-i^ 7, + 2 * 



^t+ilyi:t ~~ ^^tlynt^ 7 + ^ 



(6.141) 



(6.142) 



we thus have (6.123), i.e. a backward recursive expression for Z Xt \ yi . T i n terms of the filtered covariance matrices 
I Xt | yi t and the parameters of the LGSSM 



X t +l\yi:T *t|yi:t ^(^ £ |y 1:£ ^ + ^)" 1 ) 



To find the expectation parameter, we compare the last terms in (6.128) and (6.132), i.e. 



(6.143) 



4(A 2 iMi + A 22 /i 2 ) and x[ (X t |y 1:t fe t |y 1:t ) 



(6.144) 



Apparently, we have 



(6.145) 



We hence infer 



A 21l"x t+1 |y 1:T + A 22j"x t |y 1:T - ^y^t^tlyi 



(6.146) 



Solving for Mx t |y 17 - we obtain 



Vx t \ yi . T ~ A 22 A 21j"x t+1 |y 1:7 . + A 22 2 x t |y 1;t #ic t |y 1:( 



(6.147) 



Next, using 



we obtain 



A 22 A 2i - ~Jt and A 22 - 2 Xt \y 1:t ~ Jt^x t+1 \y 1:t Jt 



-î _ 



(6.148) 



^x t \y 1:T ~ JtV-x t+1 \y 1:T + i 1 ~ Jt A )Hx t \y 1:t 



(6.149) 



- Jt^x t+1 \y 1:T + Mx t |y 1:t Jt A Hx t \ yi .. t 
= Px t \y 1:t + /t(.Mx t+1 |y 1:7 - - A ^x t \y vx ) 

= Vxt&i-.t + Œx t \y 1:t AT (?x t+1 \y 1:t ) )^x t+1 \y 1:T ~~ A l*x t \y 1:t ) 



~ ^x t \y 1:t + (^x t \y 1 :t AT ( A ^x t \y 1: t AT + 2 %) )(^ t+1 |y 1:T ^Mx t |y 1:t ) 
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As for the Kalman filter équations, (6.122) and (6.123) may be written more succinctly as 

ftctlyiT = ^tWi-.t + h(Hx t+1 \y llT ~ A ^x t \ yi . x ) (6-150) 

and 

^tWv.T = Z ^|yi:t + A ( S x t+1 |y 1:T - ( A ^x t \y 1:t A T + ï. x j)jj (6.151) 
= ^x t \y 1:t + A(X t+1 |y 1:T - ^x t+1 \y 1:t )Jt 

using the définition 



A ■■= ^ tlyi ,A T (Al Xtlyi . t A T + 2 X ) _1 (6.152) 
The recursion is initialized using the Kalman filtering resuit for VL XT \y x . T an d ^x T \y x . T - 

6.3.1.4 Inferring p(x t _! , x t \y 1:T ) 

The expectation parameter components of y{x t _-y,x t \y VT ) correspond to the expectation parameters of 
P( x t-i \yi-.r) ar| d Vi x t\yi:T)> respectively, i.e., we have 

T 

fte t -i:t|yi:r = (ftet-ilyi:r'ftetlyi:r) (6.153) 

where the right-hand side components are evaluated based on the KRTS smoothing équations as discussed in section 
6.3.1.3. For the covariance matrix parameter of p(x t _ x ,x t \y 1:T ), we note that according to (6.131), we have 

S * t , t+1 |y 1: r = -^21^1 (6-154) 

and with 

Z *t + ilyi:r = T îi and A 22A 2 i = -A (6-155) 
We thus have the following équivalences for the covariance matrix of p(x t _ x , x t |y 1:T ): 

^Xï.t+i\y 1:T = ^x t+1 \y 1:T Jt 

° 2 *t-l:t|y 1:T = 2 *tlyi:T^t-l 

° ( 2 *t-i:t|y 1:T ) = ( S x t |y 1:T A T -l) 

° S *t-i:t|y 1: r = A-l^x t |y 1:T (6.156) 

where the last line corresponds to the desired recursion for E Xt _ lit| , which is initialized as 

2 «r-i:r|y l!7 . = A-l^ T |y 1:7 . (6.157) 

6.3.1.5 Summary and expectations 

The Kalman filter and KRTS smoothing recursions derived in Sections 6.3.1.1 - 6.3.1.4 are shown in 
pseudocode form in Table 4. Simplification of thèse équations for the univariate case x 1:T ,y 1:T £ M A := a £ R, 
B ■= b £ == ct x > 0 and I y := a y then yield équations (32) - (38) of the tutorial: 
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(6.159) 




(6.160) 




(6.161) 



'*t-l,tlyi:T : ]t-l^X t \y 



1:T 



<^> a. 



X t -l,t\yi:T ' Jt-l u X t \y 1:T 



(6.162) 



The variational update équations derived in Section 6.2 capitalize on the availability of the expectations of x t and x\ 
under the marginal variational distributions q^(x t ~) and the expectation of x t _ x x t under the variational joint 
marginal distributions q^ l \x t _ 1:t ) (t = 2, ...,T). As will be seen in Section 6.3.2, a necessary prerequisite for their 
évaluation is the évaluation of thèse expectations under the conditional LGSSM distribution p(x 1:T \y 1:T ) for known 
parameter 8. For the multivariate case, thèse evaluate in terms of the parameters of p(x t \y 1:T ) and p(.x t - lt x t \y 1:T ) 
for t = 1, ...,T to 



( x t)p(x t \y 1:T ) - ^x t \y- 



1:T 



(6.163) 




1:T 



(6.164) 



and 




'Xt-i:t\y 1]T 



(6.165) 



o Dérivation of euqations (6.163) - (6.164) 

The équations follow directly from the well-known identity 



<C(x,y) = E (xy T ) - E(x)E(y T ) => E (xx T ) = (C(x,x) + E(x)E(x T ) 



(see Supplément équation (1.21) and Bishop (2008) p. 20). 



□ 
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function Kalman Filter (Input, Output) 

Input y 1 . T ,[i 1 ,I 1 ,A,I. x ,B,I. y 

Output i"x t |y 1:t and Z Xt | yi:t for t = 1, ... , T 

% Initialization = Bayesian linear model inference for t = 1 

ftcdyi := + ï Xl B T (BÏ Xl B T + S y ) _1 ( yi - 
s *ilyi := ~~ 2 %i Br ( S2 xi Br + s y) BlL Xi 

% Recursion 

for t = 2, ...,T set 

Pt '■= ^c-ilyw-i^ 7, + 2 x 
AT t :=P t B T (BP t B T + Y.y)~ 1 

S * t |y 1:t == ( 7 - *t W - K t fi) T + ^y^ 
function KRTS Smoother (Input, Output) 

Input y 1:T , i4, S* and M Xt |y 1:t , 2x t |y 1:t for t = 1, ... , T 

Output M* t |y 1:r < ^x t |y 1:T for t = 1, ... , T and I Xf _ i:t , yi:T for t = 2, ... , T 

% Initialization = Kalman filter results for t = T 

^x T \y 1:T := l*x T \y 1:T 



% Recursion 

fort = T- 1,7- 2, ...,lset 



h 

V-x t \y x . T 

^X t \y 1:T 
^*t-l,tlyi:T 



,-1 



= ^tlynt^C^^tlyi:^ 7, + 2 *) 

= ^tlyi:t + A(l"x t+1 |y 1:T - A ^x t \y 1:t ) 
^x t \y 1]t + Jt(^x t+1 \y 1:T - ^x t+1 \y 1:t )J T 
Jt-l^x t \y 1:T 



Table 5. Pseudocode for the Kalman filter and KRTS smoother recursions. 
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6.3.1.6 Matlab implementation 



Below follows the Matlab implementation of the pseudocode shown in Table 5 as used in the tutorial. 

function [mu_f, Sigma_f] = lgssm_f ilter ( y , mu_x_l , Sigma_x_l , A, B, Sigma_x, Sigma_y ) 

% This function implements the forward pass for inference in a linear 

% Gaussian state space model, also known as the Kalman Filter. It 

% capitalizes on the expressions for the conditional expectation and 

% covariance as discussed in section 5 of the tutorial. Specif ically, for 

% the covariance update, Joseph' s symmetrized form is employed. 



Inputs 





y 


P 


x 


T 


data array 




mu x 1 


k 


x 


i 


initial latent variable expectation 




Sigma x 1 


k 


X 


k 


initial latent variable covariance 




A 


k 


X 


k 


transition matrix 




B 


P 


X 


k 


émission matrix 




Sigma x 


k 


X 


k 


latent variable noise covariance 




Sigma y 


P 


X 


P 


latent variable noise covariance 




Outputs 












mu f 


k 


X 


T 


filtered mean of p(x t|y (l:t)) 




Sigma f 


k 


X 


k 


x T filtered covariance of p(x t|y 



% Copyright (C) Dirk Ostwald 



% recover System and data dimensionalities 
k = size (A, 1) ; 

p = size (y, 1) ; 

T = size (y, 2) ; 

% initialize conditional expectation and covariance arrays 
mu_f = NaN (k, T) ; 

Sigma_f = NaN(k,k,T); 

% Initialization 



mu_f(:,l) = muxl + Sigma_x_l*B ' *inv (B*Sigma_x_l*B ' + Sigma_y ) * ( y ( : , 1 ) - B*mu_x_l); 

Sigma_f ( : , : , 1) = Sigma_x_l - Sigma_x_l*B ' *inv (B*Sigma_x_l*B ' + Sigma_y) *B*Sigma_x_l ; 

% filtered covariance symmetry check 

if -isequal (Sigma_f ( : , : , 1) , Sigma_f ( : , : , 1) ' ) 

Sigma_f ( : , : , 1) = 1/2* (Sigma_f ( : , : , 1) + Sigma_f ( : , : , 1 ) ' ) ; 

end 

% Recursion 



% cycle over remaining time-points 
for t = 2:T 

% predictor and Kalman gain matrices 

P = A*Sigma_f ( : , : , t-1) *A' + Sigma_x; 

K = P*B ' *inv (B*P*B ' + Sigma_y) ; 

% filtered expectation and covariance updates, Joseph 's form 
mu_f(:,t) = A*mu_f ( : , t-1) + K*(y(:,t) - B*A*mu_f ( : , t-1) ) ; 

Sigma_f ( : , : , t) =(eye(k) - K*B) *P* (eye (k) - K*B) ' + K*Sigma_y*K ' ; 

% filtered covariance symmetry check 

if -isequal (Sigma_f ( : , : , t) , Sigma_f ( : , : , t) ' ) 

Sigma_f ( : , : , t) = 1/2* (Sigma_f ( : , : , t) + Sigma_f ( : , : , t) ' ) ; 

end 

end 
end 
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function [mu_s, Sigma_s, Sigma_s_c] 



lgssm_smooth ( y , mu_f, A, Sigma_f, Sigma_ 



This function implements the backward pass for inference in a linear 
Gaussian state space model, also known as the Kalman-Rauch-Tung-Striebel 
smoother, and here in addition returning the cross-moment covariance of 
p(x_t,X_(t+l) |y_(l:T) ) 



Inputs 





y 


P 


X 


T 


data array 




mu f 


k 


X 


T 


filtered latent variable expectation 




A 


k 


X 


k 


transition matrix 




Sigma f 


k 


X 


k 


filtered latent variable covariance 




Sigma x 


k 


X 


k 


latent variable noise covariance 



Outputs 



mus 

Sigma_s 

Sigma_s_c 



k x 1 smoothed mean of p (x_t I y_ ( 1 : T) ) 

k x k x T filtered covariance of p (x_t I y_ ( 1 : T) ) 
k x k x T-l filtered cross moment covariance of 
p(x t,x (t+1) |y (1:T) ) for t = 1,...,T 



Copyright (C) Dirk Ostwald 



% recover System and data dimensionalities 
k = size (A, 1) 

p = size (y, 1) 

T = size (y, 2) 



% initialize conditional expectation, 
% covariance arrays 
mu_s = NaN (k, T) ; 

Sigma_s = NaN(k,k,T); 

Sigma se = NaN (k, k, T-l) ; 



covariance and cross moment 



% filter initialization 

mu_s ( : , T ) = mu_f ( : , T ) ; 

Sigma_s ( : , : , T) = Sigma_f ( : , : , T) ; 

% filter recursion 

for t = T-l : -1 : 1 



% smoothing matrix J 

J = Sigma_f ( : , : , t) *A' *inv (A*Sigma_f ( : , : , t) *A' + Sigma_x) ; 

% smoothed expectation and covariance updates 

mu_s(:,t) = mu_f(:,t) + J* (mu_s ( : , t+1 ) - A*mu_f ( : , t) ) ; 

Sigma_s ( : , : ,t) = Sigma_f ( : , : , t) + J* ( Sigma_s ( : , : , t+1 ) - Sigma_f ( : , : , t+1 ) ) 

% smoothed covariance symmetry check 

if -isequal (Sigma_s ( : , : , t) , Sigma_s ( : , : , t) ' ) 

Sigma_s ( : , : , t) = 1/2* (Sigma_s ( : , : , t) + Sigma_s ( : , : , t) ' ) ; 

end 



% cross moment 

Sigma_s_c ( : , : , t) = J*Sigma_s ( : , : , t+1 ) ; 

end 
end 
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6.3.2 Unified inference 

In this section, we consider the variational Bayes update équation 

<7 a+1) Oi:r) K exp(<lnp(y 1:r< z 1:r< 0)) ?( i + i) (0) ) (6.166) 

for the gênerai case that 9 is an unobserved random variable and thus the expectation over q( l+1 \6) and the 
marginal distribution p(0) in the argument of the exponential above cannot be neglected. We first note that from 
(6.71), (6.166) is équivalent to 

q (i+1) (x 1:T ) « exp (Onp(xi : r,yi:r|0)>q(i+i)(é>)) (6.167) 
which from (6.73), in turn, is équivalent to 

q (i+1) (.x 1:T ) a exp(<lnp(x 1:r |y 1:T ,é»)) £?(i+ i )(e) ) (6.168) 
Further, the argument of the logarithm above can be re-parameterized according to (6.74) as 

^ +1 )(x 1:T ) a exp ((ln (p(x x \y 1:T ) Uï =2 ^^ Ww) (6 " 169) 

which corresponds to the parameterization exploited by the KRTS inference machinery. While the proportionality 
statements of (6.166) - (6.169) are équivalent, following (D. Barber & Chiappa, 2007), we focus on the formulation of 
the variational update rule in the form of (6.167) below, sometimes referred to as the "expected energy 
formulation". For the purposes of inferring the parameters of the right-hand side of (6.167) using the KRTS inference 
framework, one may think that, for the current purposes of random 8 the parameter expectations under the 
variational distributions, or in other words, (d) q a+D^gy may be forwarded to the KRTS smoothing algorithm. 
However, this is not appropriate, as one can show, that for the LGSSM the expectation of the energy function under 
the current variational distribution over parameters q^ l+1 \6) does not equal the energy function with averaged 
parameters. In other words, in gênerai, 

(ln p(y 1:T ,x 1:T \6)) q a + i) i9) * lnp(y 1:T ,x 1:T |(6i> (? a + i) (0) ) (6.170) 

Barber and Chiappa (D. Barber & Chiappa, 2007) have shown how, nevertheless, standard inference algorithms may 
be applied for the VB update of q( l+lj '(x 1:T ) . In the following, we further discuss the inequality (6.170) as the "mean 
and fluctuation décomposition theorem" in Section 6.3.2.1 and the solution provided by (D. Barber & Chiappa, 
2007) in form of the "unified inference theorem" in Section 6.3.2.2. As in Section 6.3.1, we use the gênerai 
multivariate LGSSM form (cf. équation 6.77) throughout and only dérive the spécial univariate LGSSM case for the 
tutorial example at the end of this section. In contrast to Section 6.3.1, however, we here use a précision matrix 
rather than covariance matrix parameterization of the Gaussian distributions involved, because we posited précision 
parameter marginal (prior) distributions in (6.5) and (6.7) and now are required to take thèse explicitly into account. 
With 9 •■= (p x , A 1( A, A x , B, A y }we thus reexpress équation (6.77) for the current section as 

Pe(^l:T<yi:r) = P^AiOl) 1^=2 PaA* ( X t F^l P B ,A y Ot l*t) (6.171) 

where 

p(x x ) = NOi^AT^XpCxtlx^i) = N^Ax^A' 1 ) andp(y t |x t ) = N{y t ;Bx t ,Ay 1 ) (6.172) 
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6.3.2.1 The mean and fluctuation décomposition theorem 



The inequality (6.170) is summarized in (D. Barber & Chiappa, 2007) as the "mean and fluctuation 
décomposition theorem." Specifically, it can be shown that the expectation of the conditional energy function under 
q( l+1 \8) can be written as the sum of two components: (1) the joint distribution of x 1:T and y 1:T conditioned on 
the parameter expectations (d) q a+i)^ ar| d (2) "fluctuation" terms. For ease of notation, we omit the superscript 
(i + 1) for the variational distributions over the éléments of 6 in the formulations below. We have the following 
theorem: 

Mean and fluctuation décomposition theorem 

The following décomposition holds 

<lnp(x 1:T ,y 1:T |6O>q ( 0) = lnp(x 1:T ,y 1:T \(9) qW ) + F AAx + F BiAy (6.171) 
where, with a constant C £ R 

lnp(x 1:T ,y 1:T |(6»> £?(0) ) = - n-rf A x {x x -jU x ) 

T 

- fat - (B)q(B\A y )X t ) (Ay>q(A y ) (vt ~ <B) q ( B[Ay )X t ) 

+ c 

(6.172) 

and 

Fa,a x •■= -\YJt=iX T t{{A T h x A) q{AAx) - (A% (AlA JA x ) q(Ax) (A) q(AlAx) )x t (6.174) 

F B ,A y ■■= -\TJt=iX T t ((B T A y B) q{BAy) - (B\ {BlAy) (A y ) q(Ay) (B) q{BlAy) )x t (6.175) 

□ 

Note that the décomposition implies the inequality of (6.170) for non-zero terms of F A Ax and F B Ay . Given that the 
"fluctuation" terms F AA and F BA do not vanish except under the assumption of infinité précision of the 
variational distribution (i.e. the scénario discussed in Section 6.3.1), standard LGSSM inference algorithms supplied 
with the expected LGSSM parameters (d) q a+i)^ are thus not appropriate for inference of the variational update 

o Proof of the mean and fluctuation décomposition theorem 

We start by reformulating the left-hand side of (6.171) based on the conditional independence properties of 
the LGSSM. 

{\np(y 1:T ,x 1:T \9)) qW = {lnp(x 1:T ,y 1:T \A,A x ,B,Ay)) q ( AAxBAy ) 
= (In p(xi:t, yi:r\A, A x , B, A y )) q{AiAx)q(BiAy) 

= (ln(p(Xi) n[=2 P(* t l*t-1* A, A x ) Uï=l Viyt \x t , B, Ay))) q (A,A x )q(B,A y ) 
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(6.176) 

where the factorization of q(A, A x , B, A y ) follows from the properties of the LGSSM (Beal, 2003). Substitution of the 
respective probability density functions in (6.176) then yields 

(lnp(y 1:T , x 1:T \9)) qW = (ln(jV(x i; fi^AJ 1 ) Uï= 2 N(x t ; Ax t _ lt A' 1 ) n[=i N(y t ; Bx t , A- 1 )))^^^) 

= (In ((2tt) _ 2|AI 1 | 2 exp (~Ol - /i 1 ) T A 1 (x t - Ml))))q(AA x )q(B,A y ) 

+ (In (rÏÏ=2(27r) _2 |A;T2 exp (~^(x t - Ax^YA^ - Ax t _Sj))q(AA x MBA y ) 

+ (In (rit=i(27r)-2 lA- 1 !" 2 exp [-\{y t ~ Bx t ) T A y {y t - BxSfj) q{A , Ax)q{B ,A y) 

(6.177) 

Evaluation of the expectations with respect to the variational distributions whose arguments do not feature in the 
respective terms then yields 

(lnp(y 1:T ,x 1:T |6O)q (0) = ln^OO^lAï 1 ! 2 exp (- X -{x x - n{? A x {x t - ^Sfj 

+ (ln ^n[=2(27r) _2 |2 x r 2 exp^-i(x t - Ax t _^ T A x (x t - Ax^S^q^A^ 
+ (ln(n[=i(27r)-2|A- 1 |" 2 exp (-\{y t - Bx t ) T A y {y t - Bx t )))) q{BiAy) 

(6.178) 

Evaluating the remaining expectations and the logarithmic terms then yields 

(lnp(y 1:T ,x 1:T |6i)>q (0) = -^lnZTT-ilnlA^ 1 ! -\{x x - J u 1 ) T A 1 (x 1 

+ { _ k(çi) ln 2n _Ii ln | A -i | _ ±JX =2 ( Xt _ Ax t _{) T A x {x t - Ax t _i)) q(AA j 

+ (-^\n2n-l\n\A-^- 1 -Y l U(y t -Bx t rA y (y t -Bx t )) q(BAy) 
= -^lnZjr-ilnlA^ 1 ! -\{x x - \i{f A x {x x - 

-ln 2tt- (^ilnlA- 1 !)^^^) -^Zt= 2 ((^t ~ Ax t _ x Y A x (x t - Ax^)) q{AAx) 



2 " 2 
fe(T-l). 



-ÇlnZTT- ?-\n\A^\) q{BAy) - \ïU«yt - Bx t YA y (y t - Bx t )) q{BAy) 

(6.179) 

Summarizing ail terms independent of x lt ... , x T as a constant C £ R yields 

(lnp(y 1:T ,x 1:T |0)>q (0) = ~\{x x - n-CY A x {x x - [i-J 

~ jZt=2((*t - Ax^YA^Xt - M-i^qUA*) 
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-^Et=i<(yt - Bx t ) T A y (y t - Bx t )) q{ B Ay) 

+ c 



(6.180) 

The right-hand side of (6.180) may be decomposed into the contribution of the exponent of an LGSSM with averaged 
parameters and additional "fluctuation" terms. Exemplarily, we dérive this for a single term in the sum over 
t = 2, T on the right-hand side of (6.180) and generalize the resuit to the remaining terms below. Specifically, we 
have 

(O t - Ax^f A x (Xt - 4Xt-l)>q( AA;c ) 

= {xj A x x t — x T h. x Ax t _i — x[_tÂ} A x x t + xJ_ 1 A l A x Ax t _ t ) q(A,A x ) 

= (XtKx t )q{A,K x ) ~ ( x t rA ^^i-i>îU,AJ - ( x t-iA T A x x t ) q(AAx) + (xJ- 1 A T A x Ax t _ 1 ) q ( AiA j 
= xJ{A x ) q{AAx) x t - x t T (A x ^) £?(AA;c) x t _ 1 - x t T _ 1 (^ T A z ) £?UA;c) x t + x t T _ 1 (^ T A x ^) £?UA;c) x t _ 1 

= *t ( A x)q(A x ) x t ~ x t ( A x A )q(A\A x )q(A x ) x t-l ~ x t-l( AT ^x)q(A\A x )q(A x ) x t + x t-l( AT ^x A )q(A,A x ) X t-l 

(6.181) 

On the right-hand side of (6.181) the first three terms involve expectations over first-order terms (or, more formally, 
expectations over random variables under the identity function), which may be evaluated to 

<(x t - Ax^f A x {x t - M-JVaa*) 

= xj {A x ) q ( Ax) X t - X T (A x ) q ^ Ax ^(A) q ^ A \ A ^X t _ 1 - ^r-i(^)q(^|A ;c )(A x >q( A;c )X t + xJ^A 1 A X A) q(A,A x ) X t-l 

(6.182) 

The last term on the right-hand side of (6.181) and (6.182) involves an expectation over a second-order term in A. 
For gênerai functions g (such as the square) of a random variable, one has E(g(z)) ^ c/(E((z)). Thus, to avoid the 
explicit évaluation of the expectation (A T A x A) q ^ A Ax) , we rewrite the last term of (6.182) as follows 

xJ- 1 {A T A x A) q(AAx) x t _ 1 

= x t-l(( AT ^x A )q(A,I. x ) + ( AT )q{A\A x )(h x )q(ï. x )( AT )q(A\A x ) ~ ( AT )q(A\A x )(^x)q(A x )( AT )q(A\A x )) x t-l 

(6.183) 

i.e., by adding and subtracting the term xJ_ 1 {A T ) q ^ A \ Ax ^{A x ) q ^ Ax ^{A T ) q ^ Ax) x t _ 1 . From (6.181) and (6.182), we then 
have 

((X t - Ax^YA^Xt - M-l^qCAA*) = X t(^x)q(A x ) x t 

~ xT {A-x)q(A x )( A )q(A\A x ) x t-l ~ x t-l( A )q(A\A x )(^x)q(A x ) x t 

+xï_ 1 ((A T ) q(AlAx) (A x ) q(Ax) (A) q(AlAx) )x t _ 1 
+xï- 1 ((A T A x A) q{AAx) - (A T ) q{AlAx) (A x ) q{Ax) (A) q{AlAx) )x t _ 1 

(6.184) 
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which we may rewrite as 

<O t - AXt^Y A x {x t - AKf-i)),^) = (x t - ( A )q(A\A x ) x t-l) T (^x)q(A x )( x t - (^>q(^| A;c )X t _ 1 ) 

+x t T _! ((^A^)^^) - {A T ) qiMAx) {A x ) qiAx) {A) q{MAxx) )x t _ 1 



(6.185) 



The first term in (6.185) is a quadratic term with "parameter expéditions" and the second term forms the basis for 
the fluctuation terms. Generalizing the logic of (6.181) - (6.185) to ail terms on the right-hand side of (6.180) then 
yields 

(ln(y 1:T ,x 1:T |6»)>q (0) = -^lnZTT-ilnlAï 1 ! - 1 -{x 1 - n^A^ -^i) 

1 T 
~2^t=l{ x t ~ ( A )q(A\A x ) x t-l) (A- x )q(A x )( x t ~ ( A )q(A\A x ) x t-l) 

-iIt=2^-l(M^>qU,A,)-M T >qU|A,)<Ax)q(A,)M>qU|A,)K-l 
" ilLl fat ~ (B) q ( B \ Ay ) x t) (A y )q(A y ) (y t ~ (B) q ( B \ Ay ) x t) 

~\l7t=i x t ((B T A y B) q(BAy) - (B T ) q(BlAy) (A y ) q{Ay) (B) q{BlAy yjx t 



+ c 



Re-ordering of the terms in (6.186) then yields 

(ln(y 1:T ,x 1:T |6i)>q (0) = -^lnZjr-ilnlA^ 1 ! -\{x 1 - \i^f A x {x x - fi-J 

-\TJt=l{ x t ~ ( A )q(A\A x ) x t-l) (A x )q(A x )( x t ~ ( A )q(A\A x ) x t-l) 
- ilLl (yt - (B) q (B\A y ) X t) (Ay>q(A y ) (yt - (B) q ( B \ Ay ) x t) 

+ c 



(6.186) 



-lU=2 x t-l(( ATA x A )q(A,A x ) ~ ( AT )q{A\A x )(h x )q(A x )( A )q(A\A x )) x t-l 

-lU=i x t ((B T A y B) q{BAy) - (B T ) q{B]Ay) (A y ) q{Ay) (B) q{B]Ay) )x t 



(6.187) 



From (6.180), the first four terms in (6.187) may be summarized as lnp(x 1:T ,y 1:T |(0)q( 0 )). Changing the subscript 
index range for the first term fluctuation term and defining 



Fa,A x '— - lTJt=l x I{{ AT ^x A )q{A,A x ) ~ ( AT )q(A\A x )(h x )q(A x )( A )q(A\A x )) x t 



(6.189) 



and 



F B ,A y ■■= -\LU x l ((B T A y B) q&{BAy) - (B 7 ') £? (0 (B|A> , ) <A y ) £? (o (Ay) <B) q (0 (B|Ay) )x t (6.190) 
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and reintroducing the superscript notation q^ l+ \6) for the variational distribution then complètes the proof. 



6.3.2.2 The unified inference theorem 

Based on the mean and fluctuation décomposition theorem, the unified inference theorem allows for inferring 
q(x 1:T ) in the case of random 6 governed by a variational distribution q(8). For ease of notation, we omit the 
superscripts (i + 1) forq(x 1:T ) and (i) for q(0~) in the current section as above. Note that we formulate the 
augmented LGSSM in terms of covariance, rather than précision, matrices for consistency with Section 6.3.1. 

The unified inference theorem 

To infer the distribution <7(x 1:T ) using classical inference algorithms define an LGSSM 

Pè(yi:r,*i:r) = P Ml ,z 1 (^i)n[=2PÂ2 x (^l^-i)n[=iPB,z y (ytkt) (6-191) 
with non-random parameter 9 ■= {Â, î. x , B, £ y } for which the equality 

Pg(yi:r<*i:r) = (lnp(yi:r,*i:r,éO>q(0) (6.192) 



holds. To this end, set 



y t := (o k \ ( £ = 1 T),Â := (A) q(A]Ax) ,ï x := ((A x ) q (A x )) * (6.193) 



and 



f( B ) q (B\A y )\ f( B ) q (B\A y )\ 

B t '-=[ U A \(t=l T-l), B T :=Ï 0 I (6.194) 

with L^and U B defined by the Cholesky décompositions of 

UlU A := (A T A x A) qiAAx) - (A T ) q(AlAx) (A x ) q(Ax) (A) q{AlAx) (6.195) 

U T B U B := (B T A y B) q{BAy) - (B T ) q{BlAy) (A y ) qiAy) (B) q{BlAy) (6.196) 

and 



h : = I 0 



pxk 


Opxk 1 








| (6.197) 




T k k j 





kxp 

Inferring pg(x 1:T |y 1:T ) from pgCyi^, x 1:T ) by means of KRTS smoothing then allows for the variational update 

q(x 1:T ) a exp((ln(p(x 1:T |y 1:T , 60)^0)) = exp(lnpg(y 1:T ,x 1:T )) = pg(x 1:T |y 1:T ) (6.200) 

where the proportionality statement is fulfilled with proportionality constant 1, because pg(x 1:T |y 1:T ) corresponds 
to a probability density function, which intégrâtes to unity. 

o Proof of the unified inference theorem 

We verify that 
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lnpv(y 1:T ,x 1:T ) = (\np(y 1:T ,x 1:T \6')) q (g ) ' (6.201) 

by substitution. We have 

In Pg (Vi-.t, x 1:T ) = ln (p Ml , Zl (x x ) Uï=2 Và% x (*t l*t-i) UÏ=i PbX v (?t l*t)) 

= ln(jV(x i; Mi, Ii) U T t=2 N(x t ; Âx t _ x , l x ) f[Li N(y t ; Bx t , 2 y )) 
= -^lnZTT-ilnlIil -i(x x - j u 1 ) T Iï 1 (xi -Mi) 

fe(T-l) 



2 

(p + 2fe)T 



ln 2tt - |ln|S y | - ((y t - Sx t ) T ï~ 1 (y t - Bx t )) 

(6.202) 



Summarizing ail terms independent of x 1:T in a constant C 6 E, we obtain 

lnpg(yi:r<*i:r) = -^(x x - j u 1 ) T Sï 1 (^i - Mi) 

-\Yl=i{y t -Bx t ) T t-\y t -Bx t ) 



2 ■ 
+ C 



(6.203) 



Substitution of the augmented observed variable and parameters yields 

lnpg(yi:r<*i:r) 

= -\ix 1 - Mi) T ir 1 (. x i - Mi) 

~\YJt=2 (( x t ~ (A) q (A\A x ) x t-l) T (à x ) q (A x )( x t ~ (A) q (A\A x ) x t-l)) 

X T , i v -1 . 

'Vt\ f(B) q (B\A y )\ \ /((A y >q(A y )J 0 pxfe 0 pxfe \ / /Vt\ /<B)q(B|A y ) 



-HfeH £ r 



i ((° T ) f (s><?(BiAy) ] y (^ {Ay)qiAy ^~ i ° pxk ° pxk \ ((o t ^ ( {B)q(BiAy) 

\ k \ Ub / / \ Q kxp 0 kxk I k J \ k \ Ub 

+ C 

(6.204) 

Resolving the brackets of the third and fourth sum terms of the right-hand side of (6.204) then yields 

lnpg(yi:T,*i:r) = - \( x i ~ Mi^rVi - Mi) 
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1 T 
-^t=2{x t ~ (A^WAjXt.J (A x ) q{Axx) (x t - (^> £?U | A;c )X t _ 1 ) 




,-1 



-1 



u fcxp 
Ofcxp 



,-1 



+ ^=i(° fe o kxp i k o kxk u A 

\ O kxp n. / V U* 



u pxk 


u pxfc 


T 


u fexfe 


/cx/c 


h 

K , 


'pxfc 


°pxfe \ 


/fc 


Ofcxfc 1 


'fcxfc 


/fc / 




-1 



( B >q(B|A y )\ /((Ay>q(A y )J O pxfe O pxfe \ (<B) q ( B[Ay ) 



-^Ï=l4[ U A II 0kxp ' / k O kxk I l ^ 



Ofexp Ofexfe 4: 



/yr\ T /(<Ay)q(A y )) 1 O pxfe O pxk \ f{B) q (B\Ay) 

+ I O kxp / fc O kxk I I O )*r 



.0 



k V Ofexp Ofcxfc 4 / ^ ^ B 




'pxk 


Opxfe 1 




h 


Ofexfe 




'kxk 


4 y 


V u B 



O fe xp 4 O frxfr I l « pr 

Ofexp 



+ c 



Writing out the matrix products on the right-hand side of (6.205) then yields 

lnpg(y 1:r ,x 1:T ) = ~\i x i -V-iY^i*! - Mi) 

1 T 
"ZUfct - (^>qU|A,)^-l) (K)q{K x A X t ~ (A) q (A\A x ) x t-l) 

-\YJt=iyt(^y)q{A y )yt 

+ 2r=i 1 yt(A y >q (Ay) (s> q(B|Ay) x t 

- ^2t=l X t(B) T q ( B]A ^(Ay) q (A y )W) q ( BlAy )X t 

~ 2 St=l X ? ^A^A x t 
~ 2 Ht=l x t^B^B x t 

- 2yT(Ay>q(Z y )yT 
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+yr<A y > £?(Ay )<B> £?(B|Ay) x T 



+ C 



(6.206) 

Summarizing terms on the right-hand side of (6.206) and writing out the matrix products UjU A and U T B U B then gives 
lnpg(yi :r ,x 1:7 0 = -jCki - i ui) T 2ï 1 (x 1 -^ x ) 

_ 2^t=2( x t _ (^>qU|A x ) x t-l) ( A x>q(A x )( x t _ ( A )q(A\A x ) x t-l) 
"ilt=l (y t - <fi>q( B |A y )*t) T ( A y>q(A y ) (?t - < B >q( B |A y )*t) 

~\lL T t=lxJ{{A T k x A) q{AiAx) - (A T ) q{AlAx) (A x ) q{Ax) (A) q{AlAx) )x t 

-\ïUxï {(B T A y B) q(BAy) - (B T ) q(BlAy) (A y ) q{Ay) (B) q{BlAy) )x t 
+C 

(6.207) 

The first three terms on the right-hand side of (6.207) together with the appropriate terms in the constant C may 
now be recognized as an LGSSM with "averaged parameters" and we have 

^Pè(yi:T>x 1:T ) = \np(x 1:T ,y 1:T \(9) qW ) 

-\TJt=ixJ{{A T h x A) q{AAx) - {A T ) q{MAx) {A x ) q{Ax) {A) q{A \ Ax) )x t 
-\Y,UxJ{{B^B) q{BXy) - {B T ) q{B ^- y \ { ^ y) {B) q{my) )x t 



For A 1 = I. 1 1 =The right-hand side of (6.208) is identical to (6.171) and we thus have shown that 



(6.208) 



(6.209) 



lnPê(yi:T<*l:'r) = (lnp(yi:T^l:T|6')>q(0) 

if 6 and y 1:T are defined as in (6.193) - (6.196). This concludes the proof. 
6.3.2.3 Unified inference for the tutorial example 

We now consider the variational distribution update équation (6.21) for the tutorial example 

q ( ' +1) Oi:r) a exp ((lnp(yi :T ,x 1:T ,a,A x ,6,A y )) q (i +1)(a)q a +1 ) (Ax)q a +1 ) (ft)q a +1 ) (/ly) ) (6.210) 
where we set according to the unified inference theorem 

<7 a+1) (*i:r) a e w((\n(p(x 1]T \y 1]T ,a,A x ,b,A y )) q ( aÀxbÀy ^ = exp(lnpg (0 (y 1:T ,x 1:T )) = pg&(.x 1:T \y 1:T ) 

(6.211) 
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Applying the observed variable and parameter augmentations introduced in the unified inference theorem to the 
univariate LGSSM under considération in the tutorial, we obtain 



and 



% :=^fort = l T 



;(0 ._ 



J(a 2 ^>qa+i)(a)qa+D(A x ) ~ < a >q(tHi)(a)< A *V' +1 >(Az) 
\ v J(^ 2 ^y>qa+i)( ft )qa+i)(A y ) ~ < 6 >J(i+i)(6)< A y>qf('+i)(Ay)/ 



/ 



\ 



+ a a 2(i+1) ) a£ +1) fc< i+1) - ^ £+1)2 a? +1) ^ +1) 



/ 



\ 



(i + 1) 



L 2 a+i) a a+i) 6 a+i) 



fort = 1, ...,r- 1 



Ày Ày J 



( 



\ 



_2('+l)_(t+l)|,(t+l) 



0 
0 



1 0 
0 V 



0 
0 



1 0 

0 1> 



The KRTS inference machinery then allows for inferring the expectation and covariance parameters of 

<7 w Oi:tO = PêCoC^iirbiir) = Pgm (*i lyi : r) F[t=2 ^^T^ô 



(6.214) 

(6.215) 
(6.216) 



(6.218) 



(6.219) 



(6.220) 



(6.221) 



The distributions on the right-hand side of (6.221) are given in parameterized form by 
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p0&(.x t \y 1:T ) = N (x t ;P.^ iT ,ô^ tl y iT il) ) (6.222) 

for t = 1, ...,T, and 

PM x t-i^t\yi:T) = n (^;/él 1:t |, 1:r ^xl l!t iyi:r) (6 - 223) 

for t = 2, respectively. Note that we use the tilde notation throughout, to distinguish the parameters of thèse 
distributions from the parameters inferred from a non-augmented LGSSM by means of KRTS smoothing. 
Consequently the remaining expectations in the variational update équations for a,A x , b and A y shown in Table 5 
evaluate in parameterized form to 

<*t V+Dfe) = <*t )p m (x t \y 1:T ) = àï t \yj l) + [&]\y ± . T ) ( 6 - 224 ) 

for t = 1, ...,T, and 

( x t-i x t>qa+D( Xt _ 1:t ) = ( x t x t-i)pg (0 Ut-i:tlyi:T) = ° 2 xt- 1:t \yi:T + lyi^tWr (6.222) 

for t = 2, ... ,T, where ô ;2 xt_ 1 . t |y 1T dénotes the off-diagonal élément in the first row of Ê®_ it|j>ir . Table 6 
summarizes the development of the variational update and unified inference framework thus far. 
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% initialization of the variational distribution parameters for a, o£, b, <r£,x 1:T 



(0) ft (0) 



0 0^ 

0 10 
0 0 1^ 



(^1 :£ . ^fe^L r : = Kalïiiciïi Filter (y 1:T , ^ al a? (0) . 5< 0 >, â y 2(0) ) 

(«k^l^^L r -(^ 1:£ „ 1:T ) t=2 r ) : =OrS5mo 0t / le r( yi:r ,^â to) .^, fct CO, .^ CO) ) 

% itérative update équations for the variational distribution parameters 
fori = 0,1,2 ... 

_2Ci+« /fi), Ci)yT f 2 (i) ./-(i) A, if 



„U+1) ._ 2«+D 0)yr (=2(0 ,„«) ,,(0 "\ + M 



,o+« ._ I 



2 + Q A* 



(,0+D 



2 0+1) . 



,0+D ._ I 



2 +Q % 



(,0+D 



■y t 



/4 i+1) 



Ù0) ._ 



,2 0) . 



^ t |y 1:t Ci+1) ) t=1 T : = faJmanF«t e r(y 1:r ,, il , £ r 1 2 ,â«,^ Ci) ,6 t Ci) ,â y 2Ci) ) 



(i+D 



'«X +1) f 0 oA 

0 10 
V 0 0 1/ 



Table 6. Pseudocode for parameterized variational parameter update équations for the unobserved variables 

a, b, A x , Xy, x-i-j 
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6.4 Evaluation of the variational free energy 3 7 (qt(i?)) 

In Supplément Section 5.5, we rewrote the variational free energy as the sum of an "average likelihood" and the KL 
divergence between the variational approximation to the posterior distribution over unobserved variables and the 
marginal (prior) over unobserved variables as follows: 

T(qHW) = £av(p(y\nq i HW)-K£(qtHmpW) = I q {i \d)\np(y\d) dd - j q^(d)\n{^^j dd 

(6.223) 

Using the variational free energy décomposition of (6.223) here would necessitate (1) the explicit dérivation of the 
"average likelihood term" for the current model under considération, and (2) the explicit dérivation of the marginal 
distribution p(x 1:T ) of in the joint distribution specified by (6.1) based on its conditional independence properties 
and the functional forms of its subcomponents specified by (6.2) - (6.10). By using an alternative décomposition of 
the variational free energy, we can circumvent thèse dérivations, and capitalize on variational expectation of 
v(yi:T> x i:T> a >À-x>b,A.y) derived in Section 6.2 and the known differential entropies of the distributions comprising 
the variational approximation. Specifically, we here use the following décomposition of the variational free energy: 

T(f>m) = £ m , (p(y,nq (0 (i9)) + ^(q (0 (i9)) = ( q^(d)\n(p(_y,-d)) dd - ( q^(d)\n(q^(d)) dd 

(6.224) 

where we refer to the terms £ av (p(y,d), q^ l \d)j and J£ (q^(d)j as "average energy" and entropy, respectively. 

For clarity, we dérive both décompositions below, omitting the algorithm itération superscript (i) of the variational 
distribution for ease of notation. 



o Dérivation of (6.223) and (6.224) 

By définition (cf. équation (13) of the tutorial), we have 

HqW)) ■■= ! qtf) In (^) dd (6.225) 
Rewriting the joint distribution over observed and unobserved variables as 

p(y,-d) = p(y\d)p(d) (6.225) 
results in the "average likelihood - KL divergence" décomposition of the variational free energy as follows: 

HqW) ■■= ï qW In (^) dd (6.226) 

= /#)ln(«)^ 

= fq(d) (lnp(y|0) + ln(^))d0 

= J q(d) ln p(y \d) dd + f qtf) ln (^) dd 

= f q(d) ln p(y |0) dd - J q(i9) ln gjg) dd 

■•= £ m (p(y\nq&))-X£(q(d)\\P&)) 

(6.226) 

Note that the KL-divergence is defined by the fact that the probability distribution integrated over corresponds to 
the numerator of the ratio in the argument of the logarithm (Cover and Thomas (2006), p. 251). Merely decomposing 
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the argument of the logarithm in the définition of the variational free energy results in its "average energy - entropy" 
décomposition as follows 

= fqtf) (lnp(y,t9) + ln(^))dt9 

= / q(i9) ln p(y, 0) dfl + J" q{d) ln (^) dd 
= Jq(i9)lnp(y,i9)di9 - / q(i9) ln q(i9) di9 
=:£ av (p(y,i9),q (0 (i9)) + ^(q (0 (i9)) 

(6.227) 

Note that the differential entropy is defined as 

K(q(-Ô)) :=-J q(i9) ln q(i9) dd (6.228) 
(Cover and Thomas (2006), p. 243), thus the change in sign in the last définition of (6.227). 

n 

Based on the "average energy - entropy" décomposition, the variational free energy for the current generative 
model and variational approximation can be evaluated to 

T (V 0 09)) = £ av (p(y,0), <Z (0 09)) + H (V°09)) (6.229) 

where 

£ av (p(y,n<7 a) W) 

- l„(r(a^) ) - a Az Info) + (a Aje - l) (tf (og) + ln (*£)) - ^a«*« 

- ^ ((^ 2( ° + W°) 2 ) - 2 ^> + ^) - I ln 2 ^ 

- in ( r (%)) - % in M + (% - 0 (* + m (*£)) - ^ « 
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(6.230) 



and 



w(q®(*)) = i]n(2ir C ff*®) 

+ a£ + ln(^) + lnr(a£) + (l-^)^(a£) 

+a®+ln(b®) + lnr(a®) + (l-a«)^a®) 



+ (T - 1) ln(27re) + Z[= 2 In |2* t _ 1:t |y 1:r I " ^^=2 In ô*,^. 



(6.231) 



o Dérivation of (6.229) - (6.331) 

Based on (6.224) 



J 1 (q © (i9)) = / q « (i9) ln(p (y, i9)) di9 - / q « (i9) ln (q « (t9)) di9 



(6.230) 



we first recall the définitions of y, i9, p(y, i9) and q©(i9) for the generative model and variational approximation 
under considération. As noted in (6.14), we have 



and thus 



From (6.12), we further have 



The "average energy term" in (6.230) 



y := y 1:T and i9 := {(x 1:Tl a, A x , b, A y )} 



p(y,i?) == p(y 1 .. T ,x 1 , T ,a,A x ,b,A y ) 



= q»(a)q»a x )q»(6)q»(A y )q«(x 1:T ) 



(6.231) 



(6.232) 



(6.233) 



, x VT , a, A x , b, Ay) dadA x dbdA y dx 1 . T 
= (ln p(y 1:T , x 1:T , a, A x , b, ^y)) q m Wq &^ x)q iO^ )q m^ )q m iXl:T) 

(6.234) 

has been evaluated in Section 6.2 for i = j = k = l = m = n as 

(ln p(y 1:T , X 1]T , a, A x , b, ^y)>q(0 (a) q(0 (A;c) q(0 (ft) q(0 (Ay) q(0 (Zl:T) 

+ ^<lnA x > (I co ( ^ ) - ^ln(27r) - ^<A x Œt=2*t " 2a£[ =2 x t _ 1 x t + « 2 S[=2^ 2 -i)>q(0 (a) q(0 (A;c) q(0 (Xt _ 1:t) 
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+ ^(lnA y ) £?(0(Ay) -^ln(27r) -i(A y (2Liyt - 2bY 1 T t= 1 y t Xt + b 2 lLi* 2 )>q(0( b )q(0(A y )q©(* t ) 
~^|(« 2 - 2w a + ii a 2 ) q(i)(a) -^\n(2naï) 

-\n(r(a Àx ) ) - a Àx ln(ô A J + (a Ax - l)<lnA x ) q(i)(A;e) " ^U*> o (0 (;i;c) 

_ ^|^ 2 - 2b ^b + Mb 2 )q(0(b) - ^ln(27TC7 b 2 ) 

- In (r ( % )) - % ln (ô Ay ) + ( % - l) (lnA y ) q(i)(Ay) - ^<A y ) q (0 (Ay) 

(6.235) 

In the following, we dérive the remaining expectations to obtain an expression of the average energy in terms of the 
variational and prior parameters. Specifically, from Supplément 6.3.2.3 we have 

<*?>,»(*) = *àlJW ( ° + {fâvjf and <*!>,«>(*,) = ^?|y 1:T (6-236) 
From Supplément 2.7, we have 

OnWxVçy = (tf (o£) + 1» and (ln(A y )) q(0(Ay) = (> (a®) + ln b^) (6.237) 

We further have 

(a 2 - 2a/i a + Ma)q(0(a) = ( fl2 > ? (0(a) ~ 2 < a >q«(a)i"a + i"a = ( cr a W + (^a°) ) ~ 2 i"a' Va + Ma (6.238) 

and likewise 

(6 2 - 2fyx 6 + jug^cop) = (è 2 ) q (0 W - 2(b) q(i)(b)f i b +[i 2 b = (af } + (fi^) 2 ) ~ 2nfn b + n 2 b (6.239) 
Finally, we have 

<^xŒt=2^t _ 2aXt=2 x t-l x t + « 2 2t=2 x2 -l)>q(0( a )q(0 (A;c )q(0( Xt _ 1:t ) 
= <(A* Z[=2 x2 ))q«(a)q«(A x )q«(x t _ 1:t ) 

-<(2aA ac Z[=2 ^-l^))q®(a)q®(A x )q©(x t _ 1:t ) 

+ <0% St=2 ^-l))q©(a)q®(A x )q®(z t _ 1;t ) 
= ^ ) qW (A;c )2[=2^ 2 > £? (0 (Xt) 

- 2 < a \(0(a) <^>q»(A») Et=2<*t-l*t><i(x t _ l!t ) 
+ < a2 >q(0(a) Zt=2(^ 2 -l)q(x t _ 1 ) 
= 2t=2(^ 2 )q(0 (Xt) 
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_2 < a >q®(a) It=2<*t-l*t>qO t -i: t ) 
+ < a2 >q« (a) <^e>,CO(^) 2[= 2 (^ 2 -l>q(x t _ 1 ) 

_ ?l ,(i) fl W^(Q yr (=2© ,..(0 ..(0 ^ 

= «SX> (s- + (CJ 2 ) - a* «-^ + Xk J + («i (0 + W") 2 ) sa (<„,,"' + OCJ)) 

(6.240) 

and, following the same line of reasoning as in (6.240) 

UyŒLiy 2 - 2b£ï =1 y t x t + b 2 U=iX^) q (o {b)q (o^ y)q (o (Xt) 

= <«S (ff^ 2 - tfS-^ + h- (^B- & S J Ù ♦ fêj)) 

(6.241) 

Substitution of (6-236) - (6.245) into (6.235) then yields 

(\np(y 1:T , x 1:T , a, À x , b, ^ y )) q (i) (a)q (i) (Àx)q (i) Wq (i) (Ày)q (i) {Xl . T) 

- in (r(a A j) - a Xx In(^) + K - l) (tf (og) + In (*£)) - ^a«*« 

+ W°) 2 ) - 2^ +A*g) -|ln(2ira 6 2 ) 
- ln ( r (%)) - % ln <V + (% - 0 0 KO + 1" b%) - ^afb^ 

(6.246) 
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For the entropy term 



H(q®&)) = ^(q (0 (a)q»a,)q»(6)q»(A y )q» (Xl;r) ) (5 . 247) 



we first note that we can re-express this term using the additivity of differential entropy for independent variables 
(Cover and Thomas (2006), p. 253) as 

H(q®W)) = H (g® (a)) + H (g® (Aj) + H ((?«>(&)) + Jf (<7®(A y )) + Jf (<7® (*i : r)) 

(6.248) 

The first four terms in this sum correspond to the differential entropies of Gaussian and Gamma probability density 
functions, whose analytical solutions are readily available from the literature (cf. Supplément Section 2). Specifically, 
we have 



M (q® (a)) = h(n (a; M ®, <r a 2 ®)) = \\n (lue a 2 ®) 

H (g® (Aj) =x(g (A x ; a®, è®)) = a® + ln fc® + ln r (a®) + (l - a®) ^(a®) 
Jf (g® 00) = Jf (JV (b; nf, a, 2(0 )) = iln (zTre a|®) 

" (* W W) = *• (G fc® )) = «« + ln fc® + ln F (a®) + (l - a®) ^(a®) 



(6.249) 
(6.250) 
(6.251) 
(6.252) 



The last term in (2.248) can be evaluated by considering the factorization properties of q®(x 1:T ). From (6.3.2.3), we 
have 

<7®Oi:tO = Pgm(.x 1:T \y 1:T ) = Pg(oOilyi:r) IÏÏ=2 Pgco (*tl*t-iyi:r) = p$m(x 1 \y 1 . T ) n[=2 ^^^^ 



Substitution in (6.248) thus yields 

Further, with the additivity of differential entropy for independent variables, we have 

= (pg (0 (x 1 |y 1:T )) + ZL2 (pgro Ot-i< ac t |y 1:T )) - Zt=i H (p>) Ot-i lyi:r)) 
= 11=2^ (pgw(.x t _ lt x t \y 1:T i) - YJ=2^ (pg(o(^lyi:r)) 



(6.253) 



(6.254) 



(6.255) 



which, in parameterized form corresponds to 

^(V°(*i:r)) = SL 2 ^(^(^-i: t ;^ t _ 1:t |y 1:r - V 1:t |y 1:r ))-aiîf(%^^ ff )) 



97 



= iZ[= 2 ln((27r e ) 2 |£ Xt _ i:t| , i:T |) - ilL- 2 1 ln(27r e à^J 

= (T - 1) ln(27re) + Z[= 2 In IV 1:t |y 1:r I " ^ IL _ 2 In ^ 

(6.255) 

This complètes the dérivation. □ 

6.5 Posterior distribution transformation 

To estimate a posterior distribution over the latent SDE parameters a £ M. and a £ M+, we used the Euler- 
Maruyama approximation, which entails the use of two transformation mappings from a and a onto the LGSSM 
parameter a £ R and a 2 x £ R + , as discussed in Supplément Section 3. If we obtain posterior pdfs over a and al 
based on VB inference for the LGSSM and would like to use them for posterior inference over the SDE parameters a 
and a, we have to account for the transformation theorem for pdfs, as introduced below. 

o General univariate transformation theorem 

For a probability measure P on the Borel cr-field S with probability density function / for the distribution of the 
random variable x and a measurable transformation given by 

T:l-> E,ï h y := T{x) (6.256) 

a probability density function g for the distribution of the random variable y = T(x) is given by 

9{y)= vtFHà\ (6 - 257) 

where T' dénotes the first derivative and | • | the absolute value function. 

□ 

More specifically, for some aspects of the current scénario of interest, we have the 
o Transformation theorem for univariate linear- affine transformation 

For the univariate linear-affine transformation case of a probability density function / for the distribution of the 
random variable , we have with a, b £ M, a =é 0 

T:II^1,XH y := T(x) := ax + b (6.258) 

a probability density function g for the distribution of the random variable y = T(x) given by 

«M "Ci Vf?) (6.259, 



To transfer probabilistic statements over LGSSM parameters to probabilistic statements over latent SDE System 
parameters, we will employ the transformation theorems above for univariate transformations below. To this end, 
we first define the transformations from LGSSM parameters a,A x and A y to the latent SDE System parameters a, a 
and Oy. Based on the inverse relationships of thèse mappings established in Supplément Sections 3 and 6, we have 



la^T^a) =^(a-l) =:a (6.260) 
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with its inverse (cf. équation (3.53)) 

rf 1 : R -> M, a h> x (a) = (1 + aAt) =: a (6.261) 
As we have been working with the reciprocal of a x , given by A x = \ , we will dérive probabilistic statements on the 

a x 

posterior distribution of a based on 

T 2 :W + ^W + ,A x »T 2 a x )=j=-(T (6.262) 
with inverse (cf. équation (3.53) and the discussion following équation 23 of the tutorial) 

77 1 : R + -> R + ,a ~ T 2 \a) := ^ = :A* (6.263) 

Finally, we have for the transformation between A y and o- y (cf. the discussion following équation 23 of the tutorial) 

T 3 : R + -> R+, A y r 3 (A y ) = À' 1 =: ct^ (6.264) 

and inverse 

T3- 1 : R+ -» ^(ct^) = (c-2)- 1 = ff -2 =:Ay (6 2 6 5) 

Based on thèse transformations, the posterior probability density functions f lt f 2 and f 3 over a,A x and A y , 

.2 

y 

g x (d) = At • A (aAt + 1) ' (6.266) 



respectively, correspond to the following posterior probability density functions over a, a and a 2 



92^) = 2 4é^- < 6 - 267 > 

and 

93{tf) = iPyY 2 h ((tfy 2 )" 1 ) (6.2XX) 

o Proof of équation (6.266) 

The transformation of the LGSSM parameter a to the latent SDE System parameter a is given by 

r i: M -» M, a h» r x (a) = ^(a - 1) = - ^ (6.268) 

and the transformation theorem for univariate linear affine transformation applies. Substitution yields 

g ± (a) = At- f ± (aAt + ^) = At • A (aAt + 1) (6.269) 

□ 

o Proof of équation (6.267) 

The transformation of the LGSSM parameter A x to the latent SDE System parameter a is given by 

T 2 : R + -» k» T 2 (AJ = (A x At)4 (2.670) 

The derivative of T 2 at A x évaluâtes to 
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3 



--À y 2 At~2 

2 x 



(2.671) 



Substitution the above and T 2 1 (cr) = in (6.2xx) then yields 



-i((T 2 At)2At z 



-i(a 2 )2(At) 2 At-2 



=<x 3 At 



a 3 At 



(2.672) 



o Proof of équation (6.268) 

he transformation of the LGSSM parameter A y to the latent SDE System parameter Oy is given by 

T 3 :U + -» U+,À y h» 7 3 (A y ) = A" 1 (6.273) 

The derivative of T 3 at A y évaluâtes to 

n{X y ) = -^Ay 1 = -A' 2 (6.274) 
Substitution the above and T 3 -1 ((7y) = Oy 2 in (6.2xx) then yields 

* w - b^i - f^h - êPr ■ #r = fe2) " 2/3fe " 2) = w_2/3 ( fe2)_I ) (6 - 275) 



100 



6.6 Matlab implementation 



The following Matlab code was used to generate Figures 1, 6 and 7 of the tutorial. Note that this code comprises the 
Kalman filter and KRTS smoothing code depicted previously. 

function vb_lgssm_2 

% This function implements the variational Bayesian estimation of a 
% univariate LGSSM/discretized linear latent SDE as discussed in the 
% tutorial Section 5 and Supplément Section 6 . 

% Copyright (C) Dirk Ostwald 



clc 

close ail 



Random number generator settings 



% flag for reproducibility of the JMP figure 
isjmpfig = 1; 

% set rng 
if isjmpfig 

rng (2012873763) 

else 

rng ( 1 shuf f le 1 ) 

end 



True, but unknown, model spécification and sampling 



% System dimensionality 
k =1 
p = 1 

% SDE parameters 
delta_t = 1 
T =50 
alpha = -0.1 
sigma = 0.1 



% dimensionality of latent variable vector 
% dimensionality of observed variable vector 



time between observations, arbitrary units 
number of discrète time points 
SDE évolution parameter 
SDE latent noise parameter 



% true 
a 

sigsqr_x 
lambda_x 
b 

sigsqr_y 
lambda_y 
c 0 



but unknown, parameters 
= delta_t*alpha + 1 
sigma A 2*delta_t 
l/sigsqr_x 
-200 
0 .01 

l/sigsqr_y 
100 



LGSSM transition parameter 

LGSSM latent variable variance parameter 

LGSSM latent variable précision parameter 

LGSSM émission parameter 

LGSSM observation noise variance 

LGSSM latent variable précision parameter 

LGSSM additive constant for reactino time example 



% intialize hidden and observed data sample arrays 
x = NaN (k, T) ; 

y = NaN (p, T) ; 

% define intial state of the latent variable vector 
x(:,l) = -1; 

% obtain initial observed state 

y_rt(:,l) = mvnrnd ( (b* ( x ( : , 1 ) ) ) + c_0 , sigsqr_y ) ; 

% cycle over time points t = 2, . . ., T 
for t = 2:T 

% sample latent state 

x(:,t) = mvnrnd (a*x (:, t-1 ), sigsqr_x) ; 

% sample observed state 

y_rt(:,t) = mvnrnd ( (b* (x (:, t) )) + c_0 , sigsqr_y ) ; 

end 

y = y_rt - c_0 ; 
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Generative model formulation 



% prior distribution parameters 



mu 1 


= 


-1 




lambda 1 


= 


1000 




sigsqr 1 


= 


1/ lambda x ; 






10 




s a 




le3 




a lx 




le-1 




b~lx 




le3 




m b 




-190 




sjo 




le3 




a ly 




leO 




b_ly 




le2 




% concaténation 


of priors for variational free energy évaluation 


prior. mu 1 




mu 1 




prior. lambda 1 




lambda 1 ; 


prior. m a 




m a 




prior . s a 




s a 




prior . a lx 




a lx 




prior . b lx 




b~lx 




prior. m b 




m b 




prior. s b 




s_b 




prior. a ly 




a ly 




prior. b ly 




b_ly 










Parameter Space Initialization 


% specify linearly partitioned parameter spaces for inference 


amin 




-300 




amax 




300 




ares 




10e4 




aspace 




linspace (amin, amax, ares) ; 


lxmax 




200; 




lxmin 




0; 




lxres 




10e4; 


lxspace 




linspace (lxmin, lxmax, lxres); 


bmin 




-300 




bmax 




300 




bres 




10e4 




bspace 




linspace (bmin, bmax, bres) ; 


lymax 




200; 




lymin 




0; 




lyres 




10e4; 



lyspace 



linspace (lymin, lymax, lyres); 



Generative model estimation/inversion 



% maximum number of itérations 
max i =5; 



% initialize variational parameter arrays 



q_m_a 

q_s_a 

q_a_lx 

q_b_lx 

q_m_b 

q_s_b 

q_a_ly 

q_b_ly 

q mu x 

q_sig_x 

q sig xx 



NaN (max_i, 1) , 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, 1) 
NaN (max_i, T) 
NaN (max_i, T) 
NaN (max i, T-l) , 



% initialize variational free energy array 
FE = NaN (max i, 1) ; 



initialize average energy and entropy component arrays 



n_ae 
n_h 
E 
H 



14 
7 

NaN ( n_ae , max_ 
NaN(n h, max i 



% number of average energy terms 
% number of entropy terms 
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% VB algorithm initiali zation 



% 


variational 


parameter 


q. 


m a(l) 


= m a; 


q_ 


"s a(l) 


= s a; 


q_ 


a lx (1) 


= a lx; 


q. 


~b lx(l) 


= b~lx; 


q_ 


"m b(l) 


= m b; 


q_ 


"s b(l) 


= sjo; 


q_ 


a ly(l) 


= a 1 y ; 


q. 


b ly(l) 


= b_ly; 



initialization for a,b 



% variational parameter initialization for x_(l:T) 



y_til 



a_til 

sigsqr x til 
b til F 



b til T 



sigsqr_y_til 



ty 

zéros (k, T) 
zéros (p, T) 
q_m_a ( 1 ) 

(q_a_lx(l) *q_b_lx(l) ) " (-1) 
[q_m_b(l) 

sqrt (q_s_a (1) *q_a_lx (1) *q_b_lx (1) ) 
sqrt (q_s_b (1) *q_a_ly (1) *q_b_ly (1) ) 
[q_m_b(l) 

0 

sqrt (q_s_b (1) *q_a_ly (1) *q_b_ly (1) ) 
[(q_a_ly(l)*q_b_ly(l)) A (-l) 0 0 
0 10 
0 0 1 



[mu_f_til, sigma_f_til] 
sigsqr_y_til) ; 

[mu_s_til, sigma_s_til, sigma_cs_til ] 



kalman_f ilter ( y_til , mu_l, sigsqr_l, a_til, sigsqr_x_til , b_til_t, b_til_T, 
krts_smoother (y, mu_f_til, a til, sigma_f_til, sigsqr_x_til) ; 



q_mu_x(l,:) = squeeze (mu_s_til) ; 

q_sig_x(l,:) = squeeze (sigma_s_til) 
q_sig_xx ( 1 , : ) = squee ze ( sigma_cs_til ) ; 



% concatenate variational parameters for variational free energy évaluation 



varia 


q. 


m a 




q. 


m a (1) 


varia 


q. 


s a 




q_ 


~s a(l) 


varia 


q. 


a lx 




q_ 


a lx (1) 


varia 


q. 


~b~lx 




q_ 


~b lx(l) 


varia 


q. 


m b 




q. 


"m b ( 1 ) 


varia 


q. 


~s~b 




q. 


~s b(l) 


varia 


q. 


a ly 




q. 


a ly(l) 


varia 


q. 


~b_ly 




q. 


b_ly(l) 


varia 


q. 


mu x 




q_ 


mu x ( 1 , : ) 


varia 


q. 


sig x 




q_ 


sig x ( 1 , : 


varia 


q. 


sig xx 




q_ 


sig xx ( 1 , 



% evaluate variational free energy 

vFE = vFreeEnergy (y, varia, prior) ; 

FE (1) = vFE.F; 

E ( : , 1 ) = vFE . E 1 ; 

H ( : , 1 ) = vFE . H 1 ; 



% VB algorithm itérations 



for i = 1 :max i-1 



% update of q(a) 



% \sum_(t=l) A (T-l) <x_t"2>_q (x_t) 

xtxt_qxt = sum(q_sig_x (i, 1 :T-1) , 2) + q_mu_x ( i , 1 : T-l ) *q_mu_x ( i , 1 : T-l ) ' ; 



% \sum_(t=l) A (T-l) <x_t,x_t+l>_q(x_t:t+l) 

xtxtt_qxt = sum(q_sig_xx (i, 1 :T-1) , 2) + q_mu_x ( i , 1 : T-l ) *q_mu_x ( i , 2 : T) ' ; 

% update équations 

q_s_a(i+l) = (lambda_x* (xtxt_qxt + (l/s_a) ) ) " (-1) ; 
q_m_a(i+l) = q_s_a ( i+1 ) * ( ( lambda_x*xtxtt_qxt ) + (m_a/s_a) ) ; 



% update of q(\lambda_x) 



% \sum_(t=2) A (T) <x_t' N 2>_q (x_t) 

xtxt_qxt_l = sum(q_sig_x (i, 2 :T) , 2) + q_mu_x ( i , 2 : T) *q_mu_x ( i , 2 : T) ' ; 
% \sum_(t=l) A (T-l) <x_t,x_t+l>_q(x_t:t+l) 

xtxtt_qxt = sum(q_sig_xx (i, 1 :T-1) , 2) + q_mu_x ( i , 1 : T-l ) *q_mu_x ( i , 2 : T) 1 ; 

% \sum_(t=l) A (T-l) <x_t"2>_q (x_t) 

xtxt_qxt_2 = sum(q_sig_x (i, 1 :T-1) , 2) + q_mu_x ( i , 1 : T-l ) *q_mu_x ( i , 1 : T-l ) 1 ; 

103 



% update équations 
q_a_lx(i+l) = (T/2) + a_lx; 

q_b_lx(i+l) = ((l/b_lx) + 1/2* (xtxt_qxt_l - 2 *q_m_a ( i+1 ) *xtxtt_qxt + (q_s_a(i+l) + 
q_m_a (i+1) "2) *xtxt_qxt_2) ) A (-1) ; 



% update of q(b) 



% \sum_(t=l) A (T) <x_t"2>_q (x_t) 

xtxt_qxt = sum(q_sig_x (i, 1 :T) , 2) + q_mu_x ( i , 1 : T) *q_mu_x ( i , 1 : T) ' ; 



% \sum_(t=l) A T y_t<x_t>_q (x_t) 
ytxt_qxt = y*q_mu_x(i, :) ' ; 



% update équations 

q_s_b(i+l) = (q_a_ly (i) *q_b_ly (i) *xtxt_qxt + ( 1 /s_b) ) " ( -1 ) 

q_m_b(i+l) = q_s_b ( i+1 ) * (q_a_ly ( i ) *q_b_ly ( i ) *ytxt_qxt +(m_b/s_b)); 

% update of q ( \lambda_y) 



% \sum_(t=l) y_t2 ) 
ytyt = y*y ' ; 

% \sum_(t=l) A T y_t<x_t>_q (x_t) 
ytxt_qxt = y*q_mu_x(i, :) ' ; 



% update équations 
q_a_ly(i+l) = (T/2) + a_ly; 

q_b_ly(i+l) = ((l/b_ly) + 1/2* (ytyt - 2 *q_m_b ( i+1 ) *ytxt_qxt + (q_s_b(i+l) + q_m_b ( i+1 ) "2 ) *xtxt_qxt ) ) A ( - 



update of q(x_(l:T)) 



y_til 



a_til 

sigsqr x til 
b til F 



b til T 



sigsqr_y_til 



[y 

zéros (k, T) 
zéros (p, T) 
q_m_a (i+1) 

(q_a_lx (i+1) *q_b_lx (i+1) ) " (-1) 
[q_m_b (i+1) 

sqrt (q_s_a (i+1) *q_a_lx (i+1) *qj 
sqrt (q_s_b (i+1) *q_a_ly (i+1) *qj 
[q_m_b (i+1) 
0 

sqrt (q_s_b (i+1) *q_a_ly (i+1) *qj 
[ (q_a_ly(i+l) *q_b_ly (i+1) ) " (-1) 
0 10 
0 0 1 



lx(i+l) 
ly (i+D 



ly (i+D 

0 0 



% q(x_(l:T)) parameter inference 
[mu_f_til, sigma_f_til ] 
b_til_t, b_til_T, sigsqr_y_til) ; 

[mu_s_til, sigma_s_til, sigma_cs_til] 
q_mu_x ( i+1 , : ) 
q_sig_x (i+1, : ) 
q_sig_xx ( i+1 , : ) 



kalman_f ilter (y_til, mu_l, sigsqr_l, a_til, sigsqr_x_til , 

krts_smoother ( y , mu_f_til, a_til, sigma_f_til, sigsqr_x_til ) 
squeeze (mu_s_til) 
squeeze ( sigma_s_til ) 
squeeze ( sigma_cs_til ) 



% concatenate 
varia . q_m_a 
varia .q_s_a 
varia. q a lx 
varia . q_b_lx 
varia . q_m_b 
varia .q_s_b 
varia. q a ly 
varia. q b ly 
varia. q mu x 
varia. q sig x 
varia. q sig xx 



variational parameters 
= q_m_a(i+l) 
= q_s_a(i+l) 
= q_a_lx(i+l) 
= q_b_lx(i+l) 
= q_m_b(i+l) 
= q_s_b(i+l) 
= q_a_ly(i+l) 
= q_b_ly(i+l) 
= q_mu_x ( i+1 , : ) 
= q_sig_x (i+1, : ) 
q_sig_xx ( i+1 , :) 



for variational free energy évaluation 



% evaluate variational free energy 

vFE = vFreeEnergy (y, varia, prior ) ; 

FE(i + l) = vFE.F; 

E ( : , i+1) = vFE.E' ; 

H ( : , i+1) = vFE.H' ; 



end 
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% plot perceptual learning example I 



h = figure; 

set (h, 'Color', [1 1 1]) 

plot([l:T], y_rt, ' LineStyle ' , ' — ', 'Color', 'k', 'LineWidth', 3,'Marker', 'o', ' MarkerFaceColor ' , 'k', 
'MarkerSize ' , 12) ; 

ylabel (' Reaction Time [ms] ' , ' FontName ' , 'Times New Roman', 'FontSize', 50) 

xlabel('Time [trials]' , 'FontName', 'Times New Roman', 'FontSize', 50) 

set(gca, 'FontSize', 50, 'FontName', 'Times New Roman', 'LineWidth', 3, 'XLim', [1 T] , 'YLim', [0 

300] , 'YColor' , 'k' , 'YTick',[100 200 300]) 

grid on 

maximize (h) 

% plot perceptual learning example II 



h = figure; 

set (h, 'Color', [1 1 1]) 
hold on 

[ax,hl,h2] = plotyy( [1:T] , y_rt, [1:T], q_mu_x ( 3 , : ) , 'plot'); 

set (get (ax ( 1 ),' Ylabel ' ) ,' String ',' Reaction Time [ms] ' , 'FontName', 'Times New Roman', 'FontSize', 50) 
set (get (ax (2 ),' Ylabel ' ) ,' String ',' Cognitive State[a.u.]' , 'FontName', 'Times New Roman', 'FontSize', 50) 
set(hl, 'LineStyle' , ' — ', 'Color', 'k', 'LineWidth', 3,'Marker', 'o', 'MarkerFaceColor', 'k', 'MarkerSize', 
12) 

set(h2, 'LineStyle' , ' — ', 'Color', 'r', 'LineWidth', 3,'Marker', 'o', 'MarkerFaceColor', 'r', 'MarkerSize', 
12) 

xlabel('Time [trials]', 'FontName', 'Times New Roman', 'FontSize', 50) 

set(ax(l), 'FontSize', 50, 'FontName', 'Times New Roman', 'LineWidth', 3, 'XLim', [1 T] , 'YLim', [0 
310] , 'YColor' , 'k' , 'YTick',[100 200 300]) 

set(ax(2), 'FontSize', 50, 'FontName', 'Times New Roman', 'LineWidth', 3, 'XLim', [1 T] , 'YLim', [-2 2 ], 
' YColor ' , ' r ' , ' YTick' ,-2:2) 

[ax,h3,h4] = plotyy ( [ 1 : T] , y_rt, [1:T], x , 'plot'); 

set (get (ax ( 1 ), 1 Ylabel ' ) ,' String ', 1 Reaction Time [ms] ' , 'FontName', 'Times New Roman', 'FontSize', 50) 
set (get (ax (2) ,' Ylabel ' ) ,' String ',' Cognitive State[a.u.]' , 'FontName', 'Times New Roman', 'FontSize', 50) 
set (h3, ' LineStyle ', ' — ', 'Color', 'k', 'LineWidth', 3,'Marker', 'o', 'MarkerFaceColor', 'k', 'MarkerSize', 
12) 

set (h4, 'LineStyle', ' — ', 'Color', 'k', 'LineWidth', 1,'Marker', 'o', 'MarkerFaceColor', 'w', 'MarkerSize', 
10) 

xlabel('Time [trials]', 'FontName', 'Times New Roman', 'FontSize', 50) 

set(ax(l), 'FontSize', 50, 'FontName', 'Times New Roman', 'LineWidth', 3, 'XLim', [1 T] , 'YLim', [0 
310] , 'YColor' , 'k' , 'YTick', [100 200 300]) 

set(ax(2), 'FontSize', 50, 'FontName', 'Times New Roman', 'LineWidth', 3, 'XLim', [1 T] , 'YLim', [-2 2 ], 
' YColor ' , ' r ' , ' YTick' ,-2:2) 
grid on 
maximize (h) 



% plot variational distribution updates 



h = figure; 

set(h, 'Color', [1 1 1]) 
iter = [1 2 5] ; 

% cycle over the first three itérations 
for i = 1:3 



% q(a) updates 
subplot (6, 3, i) 
hold on 

plot(aspace , pdf('Norm', aspace, q_m_a (iter (i) ) , sqrt (q_s_a (iter (i) ) ) ) , 'b', 'LineWidth', 2) 

plot(a ,0 , ' ro ' , 'MarkerFaceColor', 'r', 1 Markersize 1 , 10) 

plot (q_m_a (iter (i) ) ,0 , 'ko', 'MarkerFaceColor', 'w', 'Markersize', 10) 

set(gca, 'LineWidth', 1, 'FontName', 'Times New Roman', 'FontSize', 16) 

if i == 1 

xlim( [0 20] ) 

ylim([0 0.05]) 

else 

xlim( [0 2] ) 
ylim([0 14]) 

end 



% q(\lambda_x) updates 
subplot (6, 3, i+3) 
hold on 

plot(lxspace , pdf ( 'Gant' , lxspace, q_a_lx (iter (i) ) , q_b_lx (iter (i) ) ) , 'b', 'LineWidth', 2) 

plot ( lambda_x , 0 , ' ro ' , 'MarkerFaceColor', 'r', 'Markersize', 10) 

plot (q_a_lx (iter (i) ) *q_b_lx (iter (i) ) , 0 , 'ko', 'MarkerFaceColor', 'w', 'Markersize', 10) 

set(gca, 'LineWidth', 1, 'FontName', 'Times New Roman', 'FontSize', 16) 
ylim([0 0.05]) 
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% q(b) updates 
subplot (6, 3, i+6) 
hold on 

plot (bspace , pdf('Norra', bspace, q_m_b (iter (i) ) , sqrt (q_s_b (iter (i) ) ) ) , 'b', 'LineWidth', 2) 

plot (b ,0 , ' ro ' , ' MarkerFaceColor ' , 'r', ' Markersize ' , 10) 

plot (q_m_b (iter (i) ) ,0 , 'ko', 'MarkerFaceColor', 'w', 'Markersize', 10) 

set(gca, 'LineWidth', 1, ' FontName ' , 'Times New Roman', 'FontSize', 16) 

if i == 1 

xlim( [-220 -170] ) 

ylim([0 0.05]) 

else 

xlim( [-202 -198] ) 
ylim([0 12]) 

end 

% q(\lambda_y) updates 
subplot (6, 3, i+9) 
hold on 

plot(lyspace , pdf('Gam', lyspace, q_a_ly (iter (i) ) , q_b_ly (iter (i) ) ) , 'b', 'LineWidth', 2) 

plot ( lambda_y , 0 , ' ro ' , 'MarkerFaceColor', 'r', 'Markersize', 10) 

plot (q_a_ly (iter (i) ) *q_b_ly (iter (i) ) , 0 , 'ko', 'MarkerFaceColor', 'w', 'Markersize', 10) 
set(gca, 'LineWidth', 1, 'FontName', 'Times New Roman', 'FontSize', 16) 

% q(x_l:T) update 
subplot (6, 3, i+12) 
hold on 

plot([l:T], x ,' ro- ', 'MarkerSize ' , 5, 'MarkerFaceColor', 'r', 'LineWidth', 1) 

plot([l:T], q_mu_x (iter (i) , : ) ,' ko- ', 'MarkerSize ' , 5, 'MarkerFaceColor', 'w', 'LineWidth', 1) 

xlim( [ 0 T] ) 
ylim([-l 1]) 

set(gca, 'LineWidth', 1, 'FontName', 'Times New Roman', 'FontSize', 16) 

end 

% variational free energy 
subplot (6, 3, [16 18] ) 

plot ( [0 :max_i-l] , FE, 'ko-', 'MarkerFaceColor', 'w', 'LineWidth', 2, 'MarkerSize', 8) 
box off 

set(gca, 'LineWidth', 1, 'FontName', 'Times New Roman', 'FontSize', 16, 'xtick', [0:max_i-l]) 
xlim([-l max_i] ) 
maximize (h) 



% plot SDE posterior parameter distributions 



% transform posterior pdf over a to posterior pdf over\alpha 
alphaspace = linspace (-2, 2, 5000) ; 
p_alpha_pri = NaN (length (alphaspace ), 1 ) ; 
p_alpha_pos = NaN (length (alphaspace) , 1) ; 
for i = 1 : length (alphaspace ) 

p_alpha_pri (i) = delta_t*pdf ( ' Norm ' , (delta_t*alphaspace (i) ) + 1, m_a , sqrt(s_a)); 

p_alpha_pos (i) = delta_t*pdf ( 'Norm' , (delta_t*alphaspace (i) ) + 1, q_m_a(end), sqrt (q_s_a (end) )) ; 

end 

% transform posterior pdf over \lambda_x to posterior pdf over sigma 
sigmaspace = linspace (eps, . 3, 5000 ) ; 
p_sigma_pri = NaN (length ( sigmaspace ), 1 ) ; 
p_sigma_pos = NaN (length (sigmaspace) , 1) ; 

for i = 1 : length ( sigmaspace) 

p_sigma_pri (i) = ( 2 / (delta_t* ( sigmaspace ( i ) ~3 ))) *pdf (' Gam ' , 1 / (delta_t* ( sigmaspace ( i ) "2 )) , a_lx 
, b_lx) ; 

p_sigma_pos (i) = (2/ (delta_t* (sigmaspace (i) A 3) )) *pdf (' Gam' , 1/ (delta_t* (sigmaspace (i) A 2) ) , q_a_lx(end) 
, q_b_lx (end) ) ; 
end 

% no transformation necessary for b 
p_b_pri = pdf ('Norm', bspace, mjo, sqrt(s_b)); 
p_b_pos = pdf ('Norm', bspace, q_m_b(end), sqrt (q_s_b (end) )) ; 

% transform posterior pdf over \lambda_y to posterior pdf over sigma_y 
sigsqryspace = linspace (eps , 0 . 2 , 1 0 0 0 ) ; 
p_sigsqr_y_pri = NaN ( length ( sigsqryspace ), 1 ) ; 
p_sigsqr_y_pos = NaN ( length ( sigsqryspace) , 1 ) ; 

for i = 1 :length (sigsqryspace) 

p_sigsqr_y_pri (i) = ( sigsqryspace ( i ) " ( -2 )) *pdf (' Gam ' , ( sigsqryspace ( i ) " ( -1 )) , a_ly 

p_sigsqr_y_pos (i) = ( sigsqryspace ( i ) " ( -2 )) *pdf (' Gam ' , ( sigsqryspace ( i ) " ( -1 )) , q_a_ly(end) 

q_b_ly (end) ) ; 

end 



, b_ly ) ; 
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% plot the posterior distributions over SDE parameter 
h = figure; 

set(h, 'Color', [1 1 1]) 
subplot (2, 2, 1) 

[ax,hl,h2] = plotyy (alphaspace , p_alpha_pri, 
set (get (ax ( 1 ) , ' Ylabel ') , 1 String ' , 'Prior' 
set (get (ax ( 2 ) , ' Ylabel ') , 'String', 'Posterior 
set(hl, 'LineStyle' , '-' , 'LineWidth', 3) 
set(h2, 'LineStyle' , '-' , 'LineWidth', 3) 

xlabel ( ' \alpha ' , ' FontName ' , 'Times New Roman', 'FontSize 
set(ax(l), 'FontSize', 26, 'FontName', 'Times New Roman', 
set(ax(2), 'FontSize', 26, 'FontName', 'Times New Roman', 



alphaspace , p_ 
, ' FontName 
, ' FontName 



alpha_pos, 'plot'); 



' Times 
' Times 



New 
New 



. 40) 

1 LineWidth ' , 
'LineWidth' , 



Roman ' 
Roman ' 



'XLim' , 
'XLim' , 



1 FontSize ' 
' FontSize ' 



26) 
26) 



1] ) 
1] ) 



% plot the posterior distributions over SDE parameter sigma 
subplot (2, 2, 2) 

[ax,hl,h2] = plotyy ( sigmaspace , p_sigma_pri , sigmaspace , p 
set (get (ax ( 1 ) , ' Ylabel 1 ) , 1 String ' , 1 Prior ' , ' FontName 

set (get (ax ( 2 ),' Ylabel ' ) ,' String ',' Posterior ' , 'FontName 
set(hl, 'LineStyle' ,'-' , 'LineWidth', 3) 
set(h2, 'LineStyle' ,'-' , 'LineWidth', 3) 

xlabel (' \sigma ' , 'FontName', 'Times New Roman', 'FontSize', 40) 
set(ax(l), 'FontSize', 26, 'FontName', 'Times New Roman', 'LineWidth' 
sigmaspace (end) ] ) 

set(ax(2), 'FontSize', 26, 'FontName', 'Times New Roman', 'LineWidth' 
sigmaspace (end) ] ) 



sigma_pos, 'plot'); 
, ' Times New Roman ' , 
, ' Times New Roman ' , 



3, 
3, 



'XLim' 



'XLim' 



1 FontSize ' 
1 FontSize ' 



26) 
26) 



[sigmaspace (1) 
[sigmaspace (1) 



% plot the posterior distributions over parameter b 
subplot (2, 2, 3) 

[ax,hl,h2] = plotyy (bspace, p_b_pri, bspace , p_b 



set (get (ax (1) , 'Ylabel 
set (get (ax (2) , 'Ylabel 
set (hl , 'LineStyle', ' - 
set(h2, 'LineStyle' , ' - 
xlabel ('b', 'FontName 
set(ax(l), 'FontSize' 
set(ax(2), 'FontSize' 



pos, ' plot ' ) 
' FontName ' , 
' FontName ' , 



) , ' String ',' Prior ' 
) ,' String 1 ,' Posterior ' 
, 'LineWidth', 3) 
, 'LineWidth', 3) 

, 'Times New Roman', 'FontSize', 40) 
26, 'FontName', 'Times New Roman', 
26, 'FontName', 'Times New Roman', 



Times 
Times 



New 
New 



LineWidth' , 
LineWidth' , 



Roman ' 
Roman ' 



'XLim' , 
'XLim' , 



' FontSize ' 
1 FontSize ' 



26) 
26) 



-210 
-210 



-190] ) 
-190] ) 



% plot the posterior distributions over parameter \sigma_y' N 2 
subplot (2, 2, 4) 

[ax,hl,h2] = plotyy (sigsqryspace 



set (get (ax (1) , 'Ylabel 
set (get (ax (2) , 'Ylabel 
set(hl, 'LineStyle' , '- 
set(h2, 'LineStyle' , '- 
xlabel ( ' \sigma_y~2 ' , 
set(ax(l), 'FontSize' 
sigsqryspace (end) ] ) 
set(ax(2), 'FontSize' 
sigsqryspace (end) ] ) 
maximize (h) 



p_sigsqr_y_pri , 
) , ' String ',' Prior ' , 
) ,' String ', 1 Posterior ' , 
, 'LineWidth', 3) 
, 'LineWidth', 3) 
FontName ' , ' Times New Roman ' , 



sigsqryspace , p_ 
FontName', 'Times 
FontName ' , ' Times 



1 FontSize ' 



40) 



sigsqr_y_pos 
New Roman ' , 
New Roman ' , 



26, 'FontName' 



26, 'FontName' 



'Times New Roman' 



'Times New Roman' 



'LineWidth', 3, 'XLim' 



'LineWidth', 3, 'XLim' 



'plot' ) ; 
FontSize ' , 
FontSize ' , 



26) 
26) 



[sigsqryspace (1) 
[sigsqryspace (1) 



% display alpha crédible interval 



% numeric intégration scheme for the evaalution of a crédible interval 

delta_alpha = diff (alphaspace) ; 

cdf_alpha = NaN ( 1 , length (delta_alpha) ) 

cdf_alpha(l) = 0 

for i = 2 : length ( cdf_alpha) 

cdf_alpha(i) = cdf_alpha (i-1) + (delta_alpha (i-1) * ( (P_alpha_pos (i) + p_alpha_pos (i-1) ) /2) ) ; 

end 

% find first index with p(\alpha) > 0.025 
lower_idx = 1; 

while cdf_alpha (lower_idx) < 0.025 
lower idx = lower idx + 1; 

end 

% find last index with p(\alpha) < 0.975 
upper idx = 1; 

while cdf_alpha (upper_idx) < 0.975 
upper idx = upper idx + 1; 

end 

upper idx = upper idx - 1; 

disp ( [num2str (cdf_alpha (upper_idx) - cdf_alpha ( lower_idx) , 2 ) 1 crédible interval: [', ... 
num2str (alphaspace (lower_idx) , 2) , ' , ' num2str (alphaspace (upper_idx) , 2) , ' ] ' ] ) 

% display sigma crédible interval 



% numeric intégration scheme for the evaalution of a crédible interval 
delta_sigma = diff (sigmaspace) ; 
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cdf_sigma = NaN ( 1 , length (delta_sigma) ) ; 
cdf_sigma(l) = 0 ; 
for i = 2 : length ( cdf_sigma) 

cdf_sigma (i) = cdf_sigma (i-1) + (delta_sigma (i-1) * ( (p_sigma_pos (i) + p_sigma_pos (i-1) ) 12) ) ; 

end 



% find first index with p(\sigma) > 0.025 
lower idx = 1; 

while cdf_sigma (lower_idx) < 0.025 
lower_idx = lower_idx + 1; 

end 

% find last index with p(\sigma) < 0.975 
upper_idx = 1; 

while cdf_sigma (upper_idx) < 0.975 
upper_idx = upper_idx + 1; 

end 

upper idx = upper idx - 1; 

disp ( [num2str (cdf_sigma (upper_idx) - cdf_sigma ( lower_idx) , 2 ) ' crédible interval: [', 
num2str ( sigmaspace ( lower_idx) , 2 ) , ', ' num2str ( sigmaspace (upper_idx) , 2 ) , ']']) 

end 



function [mu_f , Sigma_f ] = kalman_f ilter ( y , mu_x_l , Sigma_x_l , A, Sigma_x, B_t , B_T, Sigma_y ) 

% This function implements the forward pass for inference in a linear 

% Gaussian state space model, also known as the Kalman Filter. It 

% capitalizes on the expressions for the conditional expectation and 

% covariance as discussed in section 5 of the tutorial . Specif ically, for 

% the covariance update, Joseph' s symmetrized form is employed. 



Inputs 





y 


P 


x 


T 


data array 




mu x 1 


k 


x 


i 


initial latent variable expectation 




Sigma x 1 


k 


x 


k 


initial latent variable covariance 




A 


k 


X 


k 


transition matrix 




Sigma x 


k 


X 


k 


latent variable noise covariance 




B t 


P 


X 


k 


émission matrix for t = 1, . . . ,T-1 




B T 


P 


X 


k 


émission matrix for t = T 




Sigma y 


P 


X 


P 


latent variable noise covariance 




Outputs 












mu f 


k 


X 


T 


filtered mean of p(x t|y (l:t)) 




Sigma f 


k 


X 


k 


x T filtered covariance of p(x t|y 



% Copyright (C) Dirk Ostwald 



% recover System and data dimensionalities 
k = size (A, 1) ; 

p = size (y, 1) 

T = size (y, 2) 

% initialize conditional expectation and covariance arrays 
mu_f = NaN (k, T) ; 

Sigma_f = NaN(k,k,T); 



% Initialization 



mu_f(:,l) = mu_x_l + Sigma_x_l*B ' *inv (B*Sigma_x_l*B ' + Sigma_y) * (y ( : , 1 ) - B*mu_x_l), 

Sigma_f ( : , : , 1) = Sigma_x_l - Sigma_x_l*B ' *inv (B*Sigma_x_l*B ' + Sigma_y) *B*Sigma_x_l ; 

% filtered covariance symmetry check 

if ~isequal ( Sigma_f ( : , : , 1 ) , Sigma_f ( : , : , 1 ) ' ) 

Sigma_f (:, :,1) = 1/2* (Sigma_f ( : , : , 1) + Sigma_f ( : , : , 1 ) ' ) ; 

end 

% Recursion 

% cycle over remaining time-points 
for t = 2:T 



% case distinction for augmented sysmtems 
if t == T 

B = B_T; 

else 

B = B_t; 

end 
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% predictor and Kalman gain matrices 

P = A*Sigma_f ( : , : , t-1) *A' + Sigma_x; 

K = P*B ' *inv (B*P*B ' + Sigma_y) ; 

% filtered expectation and covariance updates, Joseph 's form 
mu_f(:,t) = A*mu_f ( : , t-1) + K*(y(:,t) - B*A*mu_f ( : , t-1 ) ) ; 

Sigma_f ( : , : , t) =(eye(k) - K*B) *P* (eye (k) - K*B) ' + K*Sigma_y*K ' ; 

% filtered covariance symmetry check 

if -isequal (Sigma_f ( : , : , t) , Sigma_f ( : , : , t) ' ) 

Sigma_f ( : , : , t) = 1/2* (Sigma_f ( : , : , t) + Sigma_f ( : , : , t) ' ) ; 

end 

end 
end 



function [mu_s, Sigma_s, Sigma_cs] 



krts_smoother ( y , mu f, A, Sigma_f, Sigma_ 



% This function implements the backward pass for inference in a linear 

% Gaussian state space model, also known as the Kalman-Rauch-Tung-Striebel 

% smoother, and here in addition returning the cross-moment covariance of 

% p(x_t,x_(t+l) |y_(l:T) ) 



Inputs 





y 


P 


X 


T 


data array 




mu f 


k 


X 


T 


filtered latent variable expectation 




A 


k 


X 


k 


transition matrix 




Sigma f 


k 


X 


k 


filtered latent variable covariance 




Sigma x 


k 


X 


k 


latent variable noise covariance 



Outputs 



mu_s 

Sigma_s 

Sigma_cs 



k x 1 smoothed mean of p (x_t I y_ ( 1 : T) ) 

k x k x T filtered covariance of p (x_t I y_ ( 1 : T) ) 
k x k x T-1 filtered cross moment covariance of 
p(x t,x (t+1) |y (1:T) ) for t = 1,...,T 



% Copyright (C) Dirk Ostwald 



% recover System and data dimensionalities 
k = size (A, 1) ; 

p = size (y, 1 ) ; 

T = size (y, 2) ; 



% initialize conditional expectation, 
% covariance arrays 
mu_s = NaN (k, T) ; 

Sigma_s = NaN(k,k,T); 

Sigma_cs = NaN (k, k, T-1) ; 



covariance and cross moment 



filter initialization 



mu_s ( : , T ) 
Sigma_s ( : , 



■ T) 



mu_f ( : , T) ; 
Sigma_f ( : , : , T) ; 



% filter recursion 



for t = T-1 : -1 : 1 

% smoothing matrix J 

J = Sigma_f ( : , : , t) *A' *inv (A*Sigma_f ( : , : , t) *A' + Sigma_x) ; 

% smoothed expectation and covariance updates 

mu_s(:,t) = mu_f(:,t) + J* (mu_s ( : , t+1 ) - A*mu_f ( : , t) ) ; 

Sigma_s ( : , : ,t) = Sigma_f ( : , : , t) + J* ( Sigma_s ( : , : , t+1 ) - Sigma_f ( : , : , t+1 ) 

% smoothed covariance symmetry check 

if -isequal (Sigma_s ( : , : , t) , Sigma_s ( : , : , t) ' ) 

Sigma_s ( : , : , t) = 1/2* (Sigma_s ( : , : , t) + Sigma_s ( : , : , t) ' ) ; 

end 



% cross moment 
Sigma_cs ( : , : , t) 



J*Sigma_s ( : , : ,t+l) ; 



end 
end 
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function [vFE] = vFreeEnergy (y, varia, prior) 



% This function évaluâtes the variational free energy of the tutorial 
% example generative model and its variational approximation 



Inputs 



y 

varia 
prior 



1 x T array of observed variable values 
structure comprising the variational parameters 
structure comprising the prior parameters 



Outputs 



vFE 



: variational free energy 
% Copyright (C) Dirk Ostwald 



% extract 


variational and prior parameters 


q m a 




varia 


q m a 


q s a 




varia 


q s a 


q a lx 




varia 


q a lx 


q b lx 




varia 


q b lx 


q m b 




varia 


q m b 


q s b 




varia 


q s b 


q_a_ly 




varia 


q_a_ly ; 


q_b_ly 




varia 


q_b_ly ; 


q mu x 




varia 


q mu x ; 


q sig x 




varia 


q sig x ; 


q sig xx 




varia 


q sig xx ; 



m_a 

s_a 

a_lx 

b~lx 

m_b 

sjo 

a_ly 

b_ly 

mu_l 

lambda 1 



prior . m_a 
prior . s_a 
prior . a lx 
prior. b lx 
prior .m_b 
prior . s_b 
prior . a_ly 
prior. b ly 
prior. mu 1 
prior . lambda_l 



% evaluate further constants 

T = size (y, 2) 

xTxT_qxT = sum (q_sig_x ( 2 : T) , 2 ) 

xtxtt_qxt = sum (q_sig_xx ( 1 : T-l ) , 2 ) 

<x_t , x_t+l >_q ( x_t : t+1 ) 

xtxt_qxt = sum (q_sig_x ( 1 : T-l ) , 2 ) 

ytyt = y*y' 

ytxt_qxt = y (2 : T) *q_mu_x (2 :T) ' 
xTTxTT_qxTT = sum (q_sig_x ( 1 : T) , 2 ) 



q_mu_x (2 :T) *q_mu_x (2 :T) 1 
q_mu_x (1 :T-1) *q_mu_x (2 :T) ' 



+ q_mu_x (1 :T-1) *q_mu_x (1 :T-1) 



+ q_mu_x ( 1 : T) *q_mu_x ( 1 : T) 1 



\sum_(t=2) A (T) <x_t"2>_q(x_t) 
\sum (t=l) A (T-l) 



\sum_(t=l) 
\sum_(t=l) 
\sum_(t=2) 
\sum (t=l) 



(T-l) <x_t"2>_q(x_t) 
(T) y_t"2 

(T) y_t*<x_t>_q(x_t) 
(T) <x_t"2>_q(x_t) 



% initialize (x_t-l,x_t) distribution covariance matrix 
log_det_Sigma_xtxtt = NaN(l,T-l); 
for t = 2:T 



% assemble covariance matrices and evaluate their déterminants 

log_det_Sigma_xtxtt (t-l) = log (det ( [q_sig_x (t-l) q_sig_xx (t-l) ;q_sig_xx (t-l) q_sig_x (t) ] ) ) ; 



end 



% evaluate free energy "average energy" terms 

vFE.E(l) = +0.5*log(lambda_l/ (2*pi) ) ; 

vFE.E(2) = -0 .5*lambda_l* ( (q_sig_x (1) + q_mu_x(l)"2) - ( 2 *q_mu_x ( 1 ) *mu_l ) + (mu_l A 2) ) ; 

vFE.E(3) = +0 . 5*T* (psi (q_a_lx) + log(q_b_lx)) - 0 . 5*T*log (2*pi) ; 

vFE.E(4) = -0 . 5* (q_a_lx*q_b_lx) * (xTxT_qxT - (2*q_m_a*xtxtt_qxt) +((q_s_a + q_m_a~2 ) *xtxt_qxt) ) ; 

vFE.E(5) = +0 . 5*T* (psi (q_a_ly) + log(q_b_ly)) - 0 . 5*T*log (2*pi) ; 

vFE.E(6) = -0 . 5* (q_a_ly*q_b_ly) * (ytyt - (2*q_m_b*ytxt_qxt) + ( (q_s_b + q_m_b' N 2 ) *xTTxTT_qxTT) ) ; 

vFE.E(7) = - (0 .5/s_a) * ( (q_s_a + q_m_a^2) - 2*q_m_a*m_a + m_a~2); 

vFE.E(8) = -0.5*log(2*pi*s_a) ; 

vFE.E(9) = -gammaln (a_lx) - (a_lx*log (b_lx) ) ; 

vFE.E(lO) = +(a_lx - 1) * (psi (q_a_lx) + log(q_b_lx)) - ( (q_a_lx*q_b_lx) /b_lx) ; 

vFE.E(ll) = - (0 .5/s_b) * ( (q_s_b + q_m_b"2) - 2*q_m_b*m_b + m_b"2); 

vFE.E(12) = -0.5*log(2*pi*s_b) ; 

vFE.E(13) = -gammaln (a_ly) - (a_ly*log (b_ly) ) ; 

vFE.E(14) = +(a_ly - 1 ) * (psi (q_a_ly) + log(q_b_ly)) - ( (q_a_ly*q_b_ly) /b_ly) ; 
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% evaluate the free energy "entropy" terms 

vFE.H(l) = 0 . 5*log (2*pi*exp (1) *q_s_a) ; 

vFE.H(2) = q_a_lx + log(q_b_lx) + gammaln (q_a_lx) + (1 

vFE.H(3) = 0T5*log (2*pi*expTl) *q_s_b) ; 

vFE.H(4) = q_a_ly + log(q_b_ly) + gammaln (q_a_ly) + (1 

vFE.H(5) = (T-l) *log (2*pi*exp (1) ) ; 

vFE.H(6) = sum (log_det_Sigma_xtxtt) ; 

vFE.H(7) = - ( (T-l)72) *sum(log(q_sig_x (2 :T-1) ) ) ; 

% evaluate free energy 

vFE.F = sum(vFE.E) + sum(vFE.H); 

end 



function maximize ( f ig) 

% This function maximizes the figure with handle fig 
% Input 

% fig : figure handle of figure to be maximized 

% Original author: Bill Finger, Creare Inc. 

% Free for redistribution as long as crédit comments are preserved. 
% Copyright (C) Dirk Ostwald 



if nargin == 0 
fig = gcf; 

end 

units = get ( f ig, ' units ' ) ; 

set (fig, 'units', ' normal ized ' , ' outerposition ' , [ 0 0 1 1 ] ) ; 
set (fig, ' units 1 , units) ; 

end 



- q_a_lx) *psi (q_a_lx) ; 

- q_a_ly) *psi (q_a_ly) ; 
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